EP-4738104-A1 - REGRESSION DETECTION USING INDICATORS FROM DEPENDENT SERVICES
Abstract
A system implements techniques for efficiently determining that an update deployed by a foundational service has caused a regression based on an aggregate health determination associated with tenant services and/or cloud resource provider services that depend upon the foundational service. The deployment of the update is initiated by an entity (e.g., an engineering team) tasked with operating and/or managing the foundational service. Accordingly, the system described herein can generate and provide a communication, to the foundational service (e.g., entity), indicating that a regression has likely been caused by the update and/or instructing the foundational service to halt the deployment of the update.
Inventors
- SUTHAR AKA GAJJAR, Meera Alpeshkumar
- NARASIMHAN, Arvind
- AGHAEI KHOUZANI, Hoda
- GANGAL, Ashish
- KUMAR, RAJIVE
- KWOK, Pui Yan
- XU, ZHANGWEI
- AGRAWAL, LAXMIKANT
Assignees
- Microsoft Technology Licensing, LLC
Dates
- Publication Date
- 20260506
- Application Date
- 20251024
Claims (15)
- A method (500) comprising: generating (502) a dependency graph that defines dependencies between foundational services and advanced services executing within geographic regions defined for a cloud computing environment; determining (504) that a particular foundational service is deploying an update via a rollout schedule associated with an order for the geographic regions; identifying (506), via the dependency graph, a set of advanced services that depend on the particular foundational service within a first geographic region in the order for the geographic regions; for an advanced service in the set of advanced services: retrieving (508) values for a plurality of service level indicators; categorizing (510) the advanced service as being one of healthy or unhealthy by applying an anomaly detection algorithm to the values; determining (512) that the update is causing a regression for the particular foundational service based on a number of unhealthy advanced services in the set of advanced services satisfying a threshold number of unhealthy advanced services; and providing (514) a regression notification to the particular foundational service, the regression notification instructing the particular foundational service to halt the deployment of the update to subsequent geographic regions in the order for the geographic regions in response to determining that the update is causing the regression for the particular foundational service.
- The method of claim 1, wherein: the anomaly detection algorithm is configured with threshold values that are specific to the advanced service; and the method further comprises learning, by a machine learning model, the threshold values by analyzing a training dataset for the advanced service over a training time period.
- The method of claim 1 or claim 2, further comprising establishing the threshold number of unhealthy advanced services by: calculating an average number of unhealthy advanced services across a defined number N of time units; calculating a standard deviation associated with the average number of unhealthy advanced services; and setting the threshold number of unhealthy advanced services to be a predefined number of standard deviations above the average number of unhealthy advanced services.
- The method of any preceding claim, wherein the foundational services include multiple types of foundational services in each of a compute foundational service category, a storage foundational service category, and a networking foundational service category.
- The method of any preceding claim, wherein the advanced services include tenant services and cloud resource provider services.
- The method of any preceding claim, further comprising determining the order for the geographic regions based on an amount of traffic registered for each geographic region in a defined time period, wherein the first geographic region in the order for the geographic regions has a lowest amount of traffic.
- The method of any preceding claim, wherein each advanced service and each foundational service comprises an identification parameter and at least one location parameter.
- A system (600) comprising: a processing system (602); and a computer readable storage medium storing instructions that, when executed by the processing system, cause the system to perform operations comprising: determining (504) that a particular foundational service is deploying an update via a rollout schedule associated with an order for the geographic regions; identifying (506), via a dependency graph, a set of advanced services that depend on the particular foundational service within a first geographic region in the order for the geographic regions; for an advanced service in the set of advanced services: retrieving (508) values for a plurality of service level indicators; categorizing (510) the advanced service as being one of healthy or unhealthy by applying an anomaly detection algorithm to the values; determining (512) that the update is causing a regression for the particular foundational service based on a number of unhealthy advanced services in the set of advanced services satisfying a threshold number of unhealthy advanced services; and providing (514) a regression notification to the particular foundational service in response to determining that the update is causing the regression for the particular foundational service.
- The system of claim 8, wherein: the anomaly detection algorithm is configured with threshold values that are specific to the advanced service; and the operations further comprise learning, by a machine learning model, the threshold values by analyzing a training dataset for the advanced service over a training time period.
- The system of claim 8 or claim 9, wherein the operations further comprise establishing the threshold number of unhealthy advanced services by: calculating an average number of unhealthy advanced services across a defined number N of time units; calculating a standard deviation associated with the average number of unhealthy advanced services; and setting the threshold number of unhealthy advanced services to be a predefined number of standard deviations above the average number of unhealthy advanced services.
- The system of any of claims 8 to 10, wherein the foundational services include multiple types of foundational services in each of a compute foundational service category, a storage foundational service category, and a networking foundational service category.
- The system of any of claims 8 to 11, wherein the advanced services include tenant services and cloud resource provider services.
- The system of any of claims 8 to 12, wherein the operations further comprise determining the order for the geographic regions based on an amount of traffic registered for each geographic region in a defined time period, wherein the first geographic region in the order for the geographic regions has a lowest amount of traffic.
- The system of any of claims 8 to 13, wherein each advanced service and each foundational service comprises an identification parameter and at least one location parameter.
- A computer readable storage medium (612) storing instructions that, when executed by a processing system, cause a system to perform operations comprising: determining (504) that a particular foundational service is deploying an update via a rollout schedule associated with an order for the geographic regions; identifying (506), via a dependency graph, a set of advanced services that depend on the particular foundational service within a first geographic region in the order for the geographic regions; for an advanced service in the set of advanced services: retrieving (508) values for a plurality of service level indicators; categorizing (510) the advanced service as being one of healthy or unhealthy by applying an anomaly detection algorithm to the values; determining (512) that the update is causing a regression for the particular foundational service based on a number of unhealthy advanced services in the set of advanced services satisfying a threshold number of unhealthy advanced services; and providing (514) a regression notification to the particular foundational service in response to determining that the update is causing the regression for the particular foundational service.
Description
BACKGROUND A cloud platform such as MICROSOFT AZURE, AMAZON WEB SERVICES, GOOGLE CLOUD, etc. is configured to provide network-based infrastructure and other resources for use by various tenants. A tenant may be a customer, a business, an organization, a client, an individual user, and so forth. An operator of a cloud platform configures and offers foundational services to support and/or enable the execution of tenant services (e.g., an application) and/or cloud resource provider services within a cloud computing environment. An entity (e.g., an engineering team) that manages a foundational service frequently deploys updates to the foundational service. An update includes modified code and/or other mechanisms configured to maintain, correct, add, and/or remove functionality (e.g., a feature) associated with the foundational service. Unfortunately, these frequently deployed updates can introduce or cause regressions that can result in functionality loss and/or sub-optimal experiences for the tenant services and/or cloud resource provider services that are supported and/or enabled by the foundational service. It is with respect to these and other considerations that the disclosure made herein is presented. SUMMARY The system described herein implements techniques for efficiently determining that an update deployed by a foundational service has caused a regression. The regression can impact the performance of tenant services and/or cloud resource provider services that depend upon the foundational service. The deployment of the update is initiated by an entity (e.g., an engineering team) tasked with operating and/or managing the foundational service. Accordingly, the system described herein can generate and provide a communication, to the foundational service (e.g., the entity), indicating that a regression has likely been caused by the update and/or instructing the foundational service to halt the deployment of the update before further functionality loss and/or sub-optimal experiences for the tenant services and/or cloud resource provider services are realized. To do this, the system generates a dependency graph that defines dependencies between the foundational services and advanced services executing within a cloud computing environment. The advanced services include the tenant services and/or the cloud resource provider services. An operator of a cloud computing environment offers the foundational services to support and/or enable the execution of the tenant services and/or the cloud resource provider services. Accordingly, the foundational services may be referred to as the "building blocks" of the cloud computing environment. A node within the dependency graph represents an advanced service or a foundational service that can be identified, or registered, within the cloud computing environment. Accordingly, each node in the dependency graph includes an identification parameter (e.g., a name) that distinguishes one service from other services. Generally, an advanced service is dependent upon multiple foundational services. Consequently, the dependency graph includes edges that connect nodes in order to reflect the dependencies. In one example, a dependency between an advanced service and a foundational service can be implicitly added to the dependency graph based on a call from the advanced service to the foundational service (e.g., an "auto-generated" dependency). In another example, a dependency between an advanced service and a foundational service can be explicitly added to the dependency graph by an owner of the advanced service or the entity tasked with operating and/or managing the foundational service (e.g., a "user-defined" dependency). Each node in the dependency graph that represents an advanced service or a foundational service further includes one or more location parameters that identify geographic regions of the cloud computing environment in which the advanced service or the foundational service is executing. The geographic regions in which the advanced service or the foundational service executes are defined by an operator of the cloud computing environment. The geographic regions can be smaller (e.g., cities, counties, states/provinces) or larger (e.g., parts of countries, continents). The foundational services can be categorized into different categories of foundational services, such as "compute" foundational services, "storage" foundational services, and "networking" foundational services. Within the different categories of foundational services there are different types of foundational services configured to satisfy the varying needs and/or preferences of the advanced services. Therefore, owners of the advanced services (e.g., tenants, resource provider teams) select amongst the different types of foundational services in a given category. For example, an owner of an advanced service may select a type of compute foundational service, a type of storage foundational service, and a type