EP-4736009-A1 - AUTOMATED SOFTWARE TESTING USING CHAOS ENGINEERING

EP4736009A1EP 4736009 A1EP4736009 A1EP 4736009A1EP-4736009-A1

Abstract

Aspects of the disclosure include methods and systems for performing automated software testing using chaos engineering. An exemplary method can include obtaining a plurality of fault scenarios and executing a test script on software under test during application of each of the plurality of fault scenarios, wherein the test script simulates the execution of a function of the software under test. The method also includes recording, for each of the plurality of fault scenarios, telemetry data regarding the execution of the function of the software under test and identifying a vulnerability of the software under test based on the recorded telemetry data.

Inventors

BAKER, WILLIAM TIGARD
WARREN, Dallas Allen
DIETRICH, AARON EDWARD
GUPTA, PIYUSH

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260506
Application Date: 20240615

Claims (20)

1. A method comprising: obtaining a plurality of fault scenarios (132); executing a test script (118) on software under test (112) during application of each of the plurality of fault scenarios (132). wherein the test script simulates the execution of a function (116) of the software under test (112); recording, for each of the plurality of fault scenarios (112), telemetry data (312) regarding the execution of the function (116) of the software under test (112); and identifying a vulnerability (808) of the software under test (112) based on the recorded telemetry data (312), wherein the software under test is operating in a computing environment (110) having a configuration (112) that comprises a plurality of computing resources (506) and wherein each of the fault scenarios include an anomaly (504) that is applied to one of the plurality of computing resource of the configuration.
2. The method of claim 1, wherein at least one of the plurality of fault scenarios includes a plurality of anomalies, that include the anomaly, and wherein at least one of the plurality of anomalies are applied to different computing resource of the configuration.
3. The method of claim 1, wherein the anomaly includes an anomaly rate that is applied to the one of the plurality of computing resource and a start time of the anomaly and end time of the anomaly.
4. The method of claim 1, where at least one of the fault scenarios are randomly generated.
5. The method of claim 1, wherein the vulnerability of the software under test is identified based on one of: a commonality of a group of the fault scenarios that correspond to recorded telemetry data having a service level indicator that deviates from a sendee level objective by more than a threshold amount; fault scenarios that correspond to recorded telemetry data having the service level indicator that deviates from the service level objective by more than a threshold amount; and fault scenarios that correspond to a crash or error generated by the software under test.
6. The method of claim 1, wherein the telemetry data includes whether an action associated with the function successfully completed.
7. The method of claim 1, wherein the test script is created based on one or more of an observed user interactions with the software under test and an application programing interface specification of the software under test.
8. The method of claim 1, wherein the test script corresponds to the function of the software under test and the test script is obtained based on one or more of a user selection of the function and a priority level associated with the function.
9. A method comprising: obtaining a first plurality of fault scenarios (132); executing a test script (118) on software under test (112) during application of each of the first plurality of fault scenarios (132), wherein the test script (118) simulates the execution of a function (116) of the software under test (112); recording, for each of the first plurality of fault scenarios (112), telemetry data (312) regarding the execution of the function (116)of the software under test (112); selecting, based on the telemetry data (312), a first fault scenario (908) from the first plurality of fault scenarios; generating, based at least in part on the first fault scenario, a second plurality of fault scenarios (132); executing the test script (118) on the software under test (112) during the application of each of the second plurality of fault scenarios; and identifying a vulnerability (916) of the software under test (112) based on the recorded telemetry data (312).
10. The method of claim 9. wherein the software under test is operating in a computing environment having a configuration that comprises a plurality of computing resources and wherein each of the first plurality of fault scenarios and the second plurality of fault scenarios include an anomaly that is applied to one of the plurality of computing resource of the configuration.
11. The method of claim 10, wherein the first plurality of fault scenarios is obtained based at least in part on the configuration.
12. The method of claim 11, wherein at least one of the first plurality of fault scenarios are randomly generated.
13. The method of claim 9, further comprising calculating a service level indicator corresponding to each of the first set of fault scenarios, the service level indicator indicating a performance of the function.
14. The method of claim 13, wherein the first fault scenario is selected based on a determination that the service level indicator corresponding to the first fault scenario deviates from an expected value by more than a threshold amount.
15. The method of claim 9, wherein the test script is automatically created based on one or more of an observed user interactions with the software under test and an application programing interface specification of the software under test.
16. The method of claim 9, wherein the test script corresponds to the function of the software under test and the test script is obtained based on one or more of a user selection of the function and a priority level associated with the function.
17. A method comprising: obtaining a plurality of fault scenarios (132); executing a test script (118) on software under test (112) during application of each of the plurality of fault scenarios (132), wherein the test script simulates the execution of a function (116) of the software under test (112); recording, for each of the plurality of fault scenarios, telemetry data (312) regarding the execution of the function (116) of the software under test (112); calculating, based on the recorded telemetry data, a service level indicator (314) regarding the execution of the function (116) of the software under test (112) during the application of each of the plurality of fault scenarios (132); determining that the service level indicator (314) deviates from a service level objective by more than a threshold amount; and identifying a vulnerability of the software under test.
18. The method of claim 17, wherein the software under test is operating in a computing environment having a configuration that comprises a plurality of computing resources and wherein each of the fault scenarios include an anomaly that is applied to one of the plurality of computing resource of the configuration.
19. The method of claim 18, wherein the anomaly includes an anomaly rate that is applied to the one of the plurality of computing resource and a start time of the anomaly and end time of the anomaly.
20. The method of claim 17. wherein the vulnerability of the software under test is identified based on one of: a commonality of a group of the fault scenarios that correspond to recorded telemetry 7 data having a service level indicator that deviates from a service level objective by more than a threshold amount; fault scenarios that correspond to recorded telemetry data having the service level indicator that deviates from the service level objective by more than a threshold amount; and fault scenarios that correspond to a crash or error generated by the software under test.

Description

AUTOMATED SOFTWARE TESTING USING CHAOS ENGINEERING INTRODUCTION [0001] The subject disclosure relates to software testing, and particularly to automated software testing using chaos engineering. [0002] Testing software prior to its release is often performed to ensure the quality and reliability of the software. Proper testing helps identify bugs, errors, and usability issues, allowing developers to fix them before the software reaches users. Traditionally, testing of new software was a manual task that required software developers to spend significant resources to ensure proper operation of the software. Attempts to reduce the time and resources required for testing new software products led to the use of test scripts to test software. Automated testing with test scripts significantly improved the efficiency of the software testing process because scripts can execute tests much faster than manual testing, allowing for quicker feedback on the software's quality and reducing the time required for testing. In addition, test scripts ensure that the same set of tests are executed consistently, eliminating human errors and variations in test execution. [0003] Chaos engineering is a discipline that involves intentionally introducing controlled disruptions or failures into a service or software system to test its resilience and identify potential weaknesses. One goal of chaos engineering is to discover and address vulnerabilities before they occur in real-world scenarios. Currently, chaos engineering systems require users to manually design experiments to simulate various failure scenarios. These experiments are then executed to inject failures or disruptions into the system. During the experiments the behavior of the system is monitored, and relevant metrics and data are collected and analyzed the system's response to the various failures. [0004] Current software testing systems apply test scripts to software that is deployed in a computing environment operating under normal, or ideal conditions. As a result, bugs, errors, and usability issues of the software that may occur in real world scenarios due to a disruption or failure in the computing environment may not be discovered before release of the software. SUMMARY [0005] Embodiments of the present disclosure are directed to methods for automated software testing using chaos engineering. An example method includes obtaining a plurality of fault scenarios and executing a test script on software under test during application of each of the plurality of fault scenarios, wherein the test script simulates the execution of a function of the software under test. The method also includes recording, for each of the plurality of fault scenarios, telemetry data regarding the execution of the function of the software under test and identifying a vulnerability of the software under test based on the recorded telemetry data. [0006] Embodiments of the present disclosure are directed to methods for automated software testing using chaos engineering. An example method includes obtaining a first plurality of fault scenarios, executing a test script on software under test during application of each of the first plurality of fault scenarios, wherein the test script simulates the execution of a function of the software under test, and recording, for each of the first plurality of fault scenarios, telemetry data regarding the execution of the function of the software under test. The method also includes selecting, based on the telemetry data, a first fault scenario from the first plurality of fault scenarios and generating, based at least in part on the first fault scenario, a second plurality of fault scenarios. The method further includes executing the test script on the software under test during the application of each of the second plurality of fault scenarios and identifying a vulnerability of the software under test based on the recorded telemetry data. [0007] Embodiments of the present disclosure are directed to methods for automated software testing using chaos engineering. An example method includes obtaining a plurality of fault scenarios, executing a test script on software under test during application of each of the plurality of fault scenarios, wherein the test script simulates the execution of a function of the software under test, and recording, for each of the plurality7 of fault scenarios, telemetry data regarding the execution of the function of the software under test. The method also includes calculating, based on the recorded telemetry data, a service level indicator regarding the execution of the function of the software under test during the application of each of the plurality of fault scenarios, determining that the service level indicator deviates from a sendee level objective by more than a threshold amount, and identifying a vulnerability of the software under test. [0008] The above features and advantages, and other features and advantages of the dis