EP-4736010-A1 - AUTOMATED FAULT SCENARIO GENERATION FOR CHAOS ENGINEERING

EP4736010A1EP 4736010 A1EP4736010 A1EP 4736010A1EP-4736010-A1

Abstract

Aspects of the disclosure include methods and systems for performing automated fault scenario generation for chaos engineering. Aspects include obtaining a configuration of a service under test, obtaining a first plurality of fault scenarios, and applying each of the first plurality of fault scenarios to the service under test. Aspects also include recording telemetry data regarding an operation of the service under test under each of the fault scenarios, selecting, based on the telemetry data, a first fault scenario from the fault scenarios, and generating a second plurality of fault scenarios. Aspects further include applying each of the second plurality of fault scenarios to the service under test, recording telemetry data regarding the operation of the service under test under each of the second plurality of fault scenarios, and identifying a vulnerability of the service under test based on the recorded telemetry data.

Inventors

BAKER, WILLIAM TIGARD
WARREN, Dallas Allen
DIETRICH, AARON EDWARD
GUPTA, PIYUSH

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260506
Application Date: 20240615

Claims (1)

CLAIMS 1. A method comprising: obtaining a configuration (114) of a service under test (112), the configuration (114) of the sendee includes a plurality of computing resources utilized by the service and a relationship between individual computing resources of the plurality of computing resources; obtaining a first plurality of fault scenarios (132), each fault scenario of the first plurality of fault scenarios (132) including an anomaly (504) that is applied to a corresponding computing resource (506) of the plurality of computing resources; applying each of the first plurality of fault scenarios (132) to the service under test (112); recording telemetry data (310) regarding an operation of the service under test (112) under each of the first plurality of fault scenarios (132); selecting, based on the telemetry data (130), a first fault scenario (402) from the first plurality of fault scenarios (132); generating, based at least in part on the first fault scenario (402). a second plurality of fault scenarios (132); applying each of the second plurality' of fault scenarios (132) to the service under test (112); recording telemetry data (310) regarding the operation of the service under test under each of the second plurality of fault scenarios; and identifying a vulnerability (504) of the service under test (112) based on the recorded telemetry data (310). 2. The method of claim 1, wherein the anomaly includes an anomaly rate that is applied to the computing resource and a start time of the anomaly and end time of the anomaly. 3. The method of claim 1, wherein the first plurality' of fault scenarios is obtained based at least in part on the plurality' of computing resources utilized by the service under test. 4. The method of claim 1, where at least one of the first plurality' of fault scenarios are randomly generated. 5. The method of claim 1, wherein the first fault scenario is selected from the first plurality' of fault scenarios based on a determination that a service level indicator of the recorded telemetry data regarding the operation of the service under test corresponding to the first fault scenario deviates from an expected value by more than a threshold amount. 6. The method of claim 5, wherein the expected value of the service level indicator is obtained based on an analysis of telemetry data regarding the operation of the sendee under test under normal operating conditions. 7. The method of claim 1, wherein at least one of the second plurality of fault scenarios is generated by applying random changes to the first fault scenario. 8. The method of claim 1, wherein the vulnerability of the service under test is identified based on a commonality' of anomalies of applied fault scenarios that correspond to recorded telemetry data having a service level indicator that deviates from an expected value by more than a threshold amount. 9. The method of claim 1, further comprising calculating a chaos severity score for each of the applied fault scenarios, the chaos severity score corresponding to recorded telemetry data having a sendee level indicator that deviates from an expected value by more than a threshold amount, and wherein the vulnerability of the service under test is identified based on the chaos severity score. 10. The method of claim 1, wherein at least one of the second plurality of fault scenarios is generated by a machine learning model based on the first fault scenario, the telemetry data regarding the operation of the service under test under the first fault scenario, and the configuration. 11. A method comprising: obtaining a configuration (114) of a service under test (112); recording a first set of telemetry' data (310) regarding an operation of the service under test (112) under normal operating conditions; calculating an expected value for each of a plurality of service level indicators (312) of the service under test based on the first set of telemetry data (310); obtaining a first plurality of fault scenarios (132); applying each of the first plurality of fault scenarios (132) to the service under test (112); recording a second set of telemetry data (310) regarding the operation of the service under test (112) under each of the first plurality of fault scenarios (132); calculating a first value for each of the plurality of service level indicators (312)of the serv ice under test (112) under each of the first plurality of fault scenarios (132) based on the second set of telemetry data (310); selecting, based on a difference between the first values and the expected values, a first fault scenario (402) from the first plurality' of fault scenarios (132); generating, based at least in part on the first fault scenario (402), a second plurality of fault scenarios (132); applying each of the second plurality of fault scenarios (132) to the service under test (H2); recording a third set of telemetry' data (310) regarding the operation of the sendee under test (112) under each of the second plurality of fault scenarios (132); and identifying a vulnerability (504) of the sen ice under test (112) based at least in part on the third set of telemetry data (310), wherein the configuration (114) of the sendee under test (112) includes a plurality of computing resources (506) utilized by the service under test (112) and a relationship between one or more of the plurality of computing resources (506) and wherein each fault scenario (132) includes an anomaly (504) that is applied to a computing resource of the configuration. 12. The method of claim 11, wherein the anomaly includes an anomaly rate that is applied to the computing resource and a start time of the anomaly and end time of the anomaly. 13. The method of claim 11, further comprising calculating a second value for each of the plurality of service level indicators of the service under test under each of the second plurality of fault scenarios based on the third set of telemetry data. 14. The method of claim 13, wherein the vulnerability of the service under test is identified based on a commonality of anomalies of one or more of the first plurality of fault scenarios and the second plurality of fault scenarios for which one of the first values and the second values deviate from the expected value by more than a threshold amount. 14. The method of claim 11, wherein the first plurality of fault scenarios is obtained based at least in part on the plurality of computing resources utilized by the service under test. 16. The method of claim 11. wherein at least one of the second plurality of fault scenarios is generated by a machine learning model based on the first fault scenario, the telemetry data regarding the operation of the service under test under the first fault scenario, and the configuration. 17. The method of claim 11, further comprising calculating a chaos severity score for each of the applied fault scenarios, the chaos severity score corresponding to recorded telemetry data having service level indicators that deviate from an expected value by more than a threshold amount, and wherein the vulnerability' of the service under test is identified based on the chaos severity score. 18. A method comprising: obtaining a configuration (114) of a service under test (1 12), an expected value for each of a plurality' of sendee level indicators (312) of the service under test (112), and a first plurality' of fault scenarios (132); applying each of the first plurality of fault scenarios (132) to the service under test (112); recording a first set of telemetry data (310) regarding an operation of the service under test under (112) each of the first plurality of fault scenarios (132); calculating, based on the first set of telemetry' data (310), a first value for each of the plurality of service level indicators (312) of the service under test (112) corresponding to each of the first plurality of fault scenarios (132); selecting, based on a difference between one or more of the first values and the expected values, a first fault scenario (402) from the first plurality of fault scenarios (132); generating, based at least in part on the first fault scenario (402), a second plurality of fault scenarios (132); applying each of the second plurality of fault scenarios (132) to the sendee under test (112); recording a second telemetry data (310) regarding the operation of the service under test (112) under each of the second plurality of fault scenarios (132); and identifying a vulnerability (504) of the sen ice under test (112) based at least in part on the second set of recorded telemetry data (310), wherein the configuration (114) of the sen ice under test (112) includes a plurality 7 of computing resources (506) utilized by the service under test (112) and a relationship between one or more of the plurality of computing resources (506) and wherein each fault scenario (132) includes an anomaly (504) that is applied to a computing resource of the configuration. 19. The method of claim 18, wherein the first plurality of fault scenarios is obtained based at least in part on the plurality of computing resources utilized by the service under test. 20. The method of claim 18. wherein at least one of the second plurality of fault scenarios is generated by a machine learning model based on the first fault scenario, the telemetry data regarding the operation of the service under test under the first fault scenario, and the configuration.

Description

AUTOMATED FAULT SCENARIO GENERATION FOR CHAOS ENGINEERING INTRODUCTION [0001] The subject disclosure relates to service validation, and particularly to automated fault scenario generation for chaos engineering. [0002] In general, service validation of a computing system refers to the process of ensuring that a particular sendee or software system meets its intended requirements and functions correctly. Service validation involves testing and validating various aspects of the computing system to ensure its reliability, performance, security’, and compliance with desired specifications. [0003] Service validation typically begins by defining the requirements and expectations for the computing system or service, such as the functionality, performance targets, security measures, scalability, compatibility’, and any other relevant criteria. Once the requirements and expectations are defined, a test plan is developed that outlines the test objectives, test cases, test scenarios, and testing methodologies to be employed. Next, the computing system or service’s performance is assessed by conducting various tests, such as load testing, stress testing, and scalability testing. These tests help determine how the system performs under different workloads and ensure it can handle expected user traffic. Service validation is an iterative process that may involve multiple testing cycles and continuous improvement based on feedback and findings to ensure that the computing system delivers the intended functionality, reliability, performance, and security to meet the requirements of its users. [0004] Chaos engineering is a discipline that involves intentionally introducing controlled disruptions or failures into a service or software system to test its resilience and identify potential weaknesses. One goal of chaos engineering is to discover and address vulnerabilities before they occur in real-world scenarios. Currently, chaos engineering systems require users to manually design experiments to simulate various failure scenarios. These experiments are then executed to inject failures or disruptions into the system. During the experiments the behavior of the system is monitored, and relevant metrics and data are collected and analyzed the system's response to the various failures. [0005] Performing service validation using chaos engineering traditionally requires the manual configuration of each chaos experiment that will be applied. The manual creation of such experiments is a time-consuming task. In addition, manually crafted chaos experiments are static and will therefore require manual updating to keep the chaos experiments up to date with changes made to the sen-ice being tested. Furthermore, since each chaos experiment must be manually crafted, the scope and breadth of the chaos experiments are limited to the type and combination of failures that are foreseeable to the designer of the chaos experiments. SUMMARY [0006] Embodiments of the present disclosure are directed to methods for performing automated fault scenario generation for chaos engineering. An example method includes obtaining a configuration of a service under test, the configuration of the service includes a plurality of computing resources utilized by the service and a relationship between individual computing resources of the plurality of computing resources. The method also includes obtaining a first plurality of fault scenarios, each fault scenario of the first plurality of fault scenarios including an anomaly that is applied to a corresponding computing resource of the plurality of computing resources and applying each of the first plurality of fault scenarios to the service under test. The method also includes recording telemetry data regarding an operation of the service under test under each of the first plurality of fault scenarios, selecting, based on the telemetry data, a first fault scenario from the first plurality of fault scenarios, and generating, based at least in part on the first fault scenario, a second plurality of fault scenarios. The method further includes applying each of the second plurality of fault scenarios to the service under test, recording telemetry data regarding the operation of the service under test under each of the second plurality of fault scenarios, and identifying a vulnerability’ of the sendee under test based on the recorded telemetry' data. The configuration of the service under test includes a plurality of computing resources utilized by the service under test and a relationship between one or more of the plurality of computing resources and each fault scenario includes an anomaly that is applied to a computing resource of the configuration. [0007] Embodiments of the present disclosure are directed to methods for performing automated fault scenario generation for chaos engineering. An example method includes obtaining a configuration of a service under test, recording a first set of telemetry data regarding an