Search

EP-4738127-A1 - OUTAGE PROJECTION IN CLOUD COMPUTING SYSTEMS

EP4738127A1EP 4738127 A1EP4738127 A1EP 4738127A1EP-4738127-A1

Abstract

Systems and methods to determine a measured risk of a service outage of a service in a cloud computing system. A system determines service dependencies and evaluates parity drift status information associated with the dependencies using an outage projection model (e.g., a machine learning model, heuristic, and/or a combination of models) trained/otherwise operative to identify a pattern of parity drift status information correlated to a historical pattern associated with a past service outage. The system determines an outage risk score and/or level representing the measured risk of a service outage occurring for the service based on the correlation. The system further provides the outage risk score and/or level (e.g., to a remediation and/or deployment orchestration system). In some examples, an alert is provided when the outage risk score and/or level satisfies a threshold (e.g., is highly indicative of a potential service outage) to proactively facilitate prevention of an outage.

Inventors

  • KIM, GEORGE
  • LANEY, Christian
  • PEREZ, ANTHONY

Assignees

  • Microsoft Technology Licensing, LLC

Dates

Publication Date
20260506
Application Date
20251023

Claims (15)

  1. A method (500), comprising: identifying a first service in a cloud computing system; identifying (506) a second service in the cloud computing system, where the first service is dependent on the second service; receiving (508) parity drift status information of the second service in the cloud computing system; determining (512) a first outage risk score for the first service in the cloud computing system based on the parity drift status information of the second service; providing (514) an indication of the first outage risk score for the first service in the cloud computing system; providing (516) an alert corresponding to the second service being out of parity when the parity drift status information of the second service indicates the second service is out of parity and the first outage risk score satisfies an upper threshold; and triggering (520) a configuration change of the first service when the first outage risk score satisfies a lower threshold.
  2. The method of claim 1, further comprising: determining a second outage risk score for the first service in the cloud computing system based on identifying a correlation between a pattern of the parity drift status information of the second service and a pattern of historical parity drift status information corresponding to a past outage of the first service; and providing an indication of the second outage risk score for the first service in the cloud computing system.
  3. The method of claim 2, wherein the second service comprises a plurality of service dependencies of the first service.
  4. The method of claim 2, wherein determining the second outage risk score comprises: inputting the parity drift status information of the second service into an outage projection model; detecting the pattern of the parity drift status information of the second service using the outage projection model; correlating the pattern of the parity drift status information of the second service to the pattern of historical parity drift status information corresponding to the past outage of the first service using the outage projection model; calculating the second outage risk score based on the correlation using the outage projection model; and outputting the second outage risk score from the outage projection model.
  5. The method of claim 4, wherein calculating the second outage risk score comprises applying at least one of: a severity weight based on a severity level of the past outage of the first service; or a frequency weight based on a number of occurrences of the past outage of the first service.
  6. The method of claim 4, further comprising: receiving service outage information associated with the past outage of the first service; receiving the historical parity drift status information corresponding to the past outage of the first service; and configuring the outage projection model to: detect the pattern of the parity drift status information of the second service; correlate the pattern of the parity drift status information of the second service to the pattern of historical parity drift status information corresponding to the past outage of the first service; and calculate the second outage risk score based on the correlation.
  7. The method of claim 2, further comprising: determining an outage risk level based on the second outage risk score; and providing an indication of the outage risk level.
  8. A system (100), comprising: a processing system (602); and memory (604) storing instructions that, when executed, cause the system to perform operations comprising: identifying a service of interest in a first cloud computing system; identifying (506) a service dependency of the service of interest in the first cloud computing system; receiving (508) parity drift status information of the service dependency in the first cloud computing system; determining (512) a first outage risk score for the service of interest in the first cloud computing system based on the parity drift status information of the service dependency; and providing (516) an indication of the first outage risk score for the service of interest in the first cloud computing system.
  9. The system of claim 8, further comprising: determining a second outage risk score for the service of interest in the first cloud computing system based on identifying a correlation between a pattern of the parity drift status information of the service dependency in the first cloud computing system and a pattern of historical parity drift status information corresponding to a past outage of the service of interest; and providing an indication of the second outage risk score for the service of interest in the first cloud computing system.
  10. The system of claim 9, wherein: the system further comprises an outage projection model; and determining the second outage risk score comprises: inputting the parity drift status information of the service dependency into the outage projection model; detecting the pattern of the parity drift status information of the service dependency using the outage projection model; correlating the pattern of the parity drift status information of the service dependency to the pattern of historical parity drift status information corresponding to the past outage of the service of interest using the outage projection model; calculating the second outage risk score based on the correlation using the outage projection model; and outputting the second outage risk score from the outage projection model.
  11. The system of claim 10, further comprising: receiving service outage information associated with the past outage of the service of interest; receiving the historical parity drift status information corresponding to the past outage of the service of interest; and training the outage projection model to: detect the pattern of the parity drift status information of the service dependency; correlate the pattern of the parity drift status information of the service dependency to the pattern of historical parity drift status information corresponding to the past outage of the service of interest; and calculate the second outage risk score based on the correlation.
  12. The system of claim 11, further comprising providing an alert corresponding to the service dependency being out of parity when the parity drift status information of the service dependency indicates the service dependency is out of parity and the first outage risk score or the second outage risk score satisfies an upper threshold.
  13. The system of claim 11, further comprising triggering a configuration change of the service of interest when the first outage risk score or the second outage risk score satisfies a lower threshold.
  14. The system of claim 11, further comprising: receiving parity drift status information of the service dependency in a second cloud computing system; determining a third outage risk score for the service of interest in the second cloud computing system based on identifying a correlation between a pattern of the parity drift status information of the service dependency in the second cloud computing system and a pattern of historical parity drift status information corresponding to a past outage of the service of interest; and providing an indication of the third outage risk score for the service of interest for the second cloud computing system.
  15. A method (500), comprising: identifying a first service in a first cloud computing system; identifying (506) a second service in the first cloud computing system, wherein: the second service is a service dependency of the first service; and the second service comprises a plurality of service dependencies of the first service; receiving (508) parity drift status information of the second service in the first cloud computing system; determining (512) an outage risk score for the first service in the first cloud computing system based on identifying a correlation between a pattern of the parity drift status information of the second service and a pattern of historical parity drift status information corresponding to a past outage of the first service; and providing (516) an indication of the outage risk score for the first service in the first cloud computing system.

Description

BACKGROUND A cloud computing system can be used to build, deploy, and manage applications and services. Cloud services of a cloud computing system are oftentimes subject to one or more distributed computing models, where a plurality of cloud resources perform specific functions or provide specific capabilities. Dependencies between a cloud service and various cloud resources exist when the service utilizes the various resources to support the service to function as intended. Thus, the one or more cloud resources are dependencies of the service. A software system deployed in a cloud computing system may include hundreds or thousands of different services and dependencies. Each of these services and dependencies can have multiple versions. "Parity drift" in the context of cloud computing refers to when a target cloud computing system starts to differ or "drift" from a source or reference cloud computing system (e.g., a last known good version that has been tested and determined to not have any bugs). This can occur due to changes in configuration (e.g., an application programming interface change, a version upgrade), data, or state that are not synchronized between the two systems. Some instances of parity drift can cause inoperability issues and, in some cases, service outages. For instance, an inoperability issue may cause performance of a feature or functionality of the cloud computing system to degrade or become unstable. It is with respect to these and other considerations that examples have been made. In addition, although relatively specific problems have been discussed, it should be understood that the examples should not be limited to solving the specific problems identified in the background. SUMMARY The technology described herein describes systems and methods to determine a measured risk of a service outage of a service in a cloud computing system. An outage projection system determines dependencies of the service and evaluates parity drift status information associated with the dependencies. In some examples, the outage projection system uses a machine learning model trained to identify a pattern of parity drift status information that is correlated to a historical pattern associated with a past service outage. The system determines an outage risk score and/or level representing the measured risk of a service outage occurring for the service based on the correlation. In other examples, the outage projection uses a heuristic model and/or a combination of models. The system further provides the outage risk score and/or level (e.g., to a remediation system). In some examples, an alert is provided when the outage risk score and/or level satisfies a threshold (e.g., is highly indicative of a potential service outage) to proactively facilitate prevention of an outage. In further examples, a new deployment or a rollback is triggered to prevent an outage. For instance, one or a combination of services can be rolled back to a latest known good state, rolled forward or back to a known state or combination of versions that is stable, etc. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. BRIEF DESCRIPTION OF THE DRAWINGS The present disclosure is illustrated by way of example by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. FIGURE 1 is a block diagram of an example operating environment in which an outage projection system is implemented according to an example;FIGURE 2 depicts an example dependency map of a service of interest according to an example;FIGURE 3 depicts example parity drift status information of a service of interest and its service dependencies according to an example;FIGURE 4A depicts an example outage risk result including ratio-based outage risk scores according to an example;FIGURE 4B depicts an example outage risk result including ratio-based outage risk scores and indications of potential outage risk levels according to an example;FIGURE 5 depicts a flow diagram depicting a first example method of determining a measured risk of a potential occurrence of a service outage according to an example; andFIGURE 6 is a block diagram illustrating example physical components of a computing device with which examples of the disclosure may be practiced. DETAILED DESCRIPTION Implementations of the present disclosure use an outage projection system to determine a measured risk (potential) of a service outage of a service of interest in a cloud computing system according to examples. More specifically, the outage projection system determines dependencies of the service of interest