US-12625795-B1 - Fault injection load testing

US12625795B1US 12625795 B1US12625795 B1US 12625795B1US-12625795-B1

Abstract

Disclosed are various embodiments for performing fault injection tests on services, systems, or applications while they are under load. First, a load level for an application is identified. Then, load traffic is sent to the application, wherein the load traffic causes the application to operate at the load level. Then, a fault injection test is performed on the application while the application is at the load level.

Inventors

Adrian John Hornsby
Serafin Antonio Sedano Arenas

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20230327

Claims (20)

1 . A system comprising: a first set of computing devices including at least one first computing device, wherein the at least one first computing device of the first set of computing devices comprises: at least one non-transitory machine-readable medium including machine-readable instructions; and at least one processor in communication with the at least one non-transitory machine-readable medium, wherein the at least one processor executes the machine-readable instructions to at least: send load traffic to an application executed by a second set of computing devices; receive, from the application, a load signal generated in response to the load traffic; determine, while the load traffic is sent to the application and based on the load signal, that the application is operating at a load level, wherein the load level represents a threshold amount of traffic causing the application to operate at capacity; while the application is operating at the load level, perform a fault injection test on the application while the application is operating at the load level; determine, during performance of the fault injection test, that a metastable fault has occurred; generate a corrective action to address the metastable fault; and cause output of a report identifying the metastable fault.
2 . The system of claim 1 , wherein to determine, during performance of the fault injection test, that a metastable fault has occurred, the at least one processor executes further machine-readable instructions to at least: analyze one or more logs with a machine learning model, and determine than an application error is a result of the fault injection test based on an output of the machine learning model, and wherein the application error is unexpected when the application is under load.
3 . The system of claim 1 , wherein to identify the load level for the application, the at least one processor executes further machine-readable instructions to at least: analyze, with a machine learning model, one or more logs associated with the application, and identify, based on an output of the machine learning model, one or more load signals indicating when the application is under load.
4 . The system of claim 1 , wherein the at least one processor executes further machine-readable instructions to at least restore the application to a pre-test state in response to occurrence of the metastable fault.
5 . A method comprising: sending load traffic to an application; receiving, from the application, a load signal generated in response to the load traffic; determining, while the load traffic is sent to the application and based on the load signal, that the application is operating at a load level, wherein the load level represents a threshold amount of traffic causing the application to operate at capacity; while the application is operating at the load level, performing a fault injection test on the application while the application is operating at the load level; detecting, during performance of the fault injection test, an application error for the application while the application is operating at the load level; and generating a corrective action to address the application error.
6 . The method of claim 5 , further comprising incrementally reducing the load traffic sent to the application in response to detecting the application error during the fault injection test.
7 . The method of claim 5 , wherein the application error comprises a metastable fault.
8 . The method of claim 5 , further comprising reporting the application error.
9 . The method of claim 5 , wherein identifying the load level for the application further comprises analyzing an application log of the application to look for a load signal indicating that the application is operating at load the load level.
10 . The method of claim 5 , wherein the application is a test version of the application and sending load traffic to the application further comprises replaying network traffic sent to a production version of the application.
11 . The method of claim 5 , wherein sending load traffic to the application further comprises: creating synthetic network traffic; and sending the synthetic network traffic to the application.
12 . The method of claim 5 , wherein the application is a test version of the application and sending load traffic to the application further comprises: duplicating live network traffic sent to a production version of the application to generate duplicate network traffic; and sending the duplicate network traffic to the test version of the application.
13 . A system comprising: a first set of computing devices including at least one first computing device, wherein the at least one first computing device of the first set of computing devices comprises: at least one non-transitory machine-readable medium including machine-readable instructions, and at least one processor in communication with the at least one non-transitory machine-readable medium, wherein the at least one processor executes the machine-readable instructions to at least: create a shadow copy of an application executed by a second set of computing devices; send load traffic to the application; receive, from the application, a load signal generated in response to the load traffic; determine, while the load traffic is sent to the application and based on the load signal, that the application is operating at a load level, wherein the load level represents a threshold amount of traffic causing the application to operate at capacity; while the application is operating at the load level, perform a fault injection test on the application while the application is operating at the load level; determine, during performance of the fault injection test, that an application error that is associated with the fault injection test occurred; generate a corrective action to address the application error; and restore the application to a pre-test state based at least in part on the shadow copy of the application.
14 . The system of claim 13 , wherein to determine, during the performance of the fault injection test, that the application error occurred, the at least one processor executes further machine-executable instructions to at least: analyze an application log during the fault injection test; and identify the application error in the application log.
15 . The system of claim 13 , wherein to determine, during the performance of the fault injection test, that the application error occurred, the at least one processor executes further machine readable instructions to at least determine that an alarm associated with the application has triggered during performance of the fault injection test.
16 . The system of claim 13 , wherein the at least one processor executes further machine-readable instructions to at least report the application error.
17 . The system of claim 13 , wherein the at least one processor executes further machine-readable instructions to at least identify the load level for the application.
18 . The system of claim 13 , wherein to send load traffic to the application, the at least one processor executes further machine-readable instructions to at least: create synthetic network traffic; and send the synthetic network traffic to the application.
19 . The system of claim 13 , wherein the load traffic comprises a random value for a function call of an application programming interface (API) provided by the application.
20 . The system of claim 13 , wherein to determine that the application error that is associated with the fault injection test occurred, the at least one processor executes further machine-readable instructions to at least: analyze one or more logs with a machine learning model; and determine from the analyzed one or more logs that the application error is unexpected when the application is under load.

Description

BACKGROUND Chaos engineering is a term used to describe approaches to testing the resiliency of computer systems in the face of unexpected external conditions. Chaos engineering may include intentionally introducing unexpected or unplanned faults into a system to determine how the system will react in response to the fault. The results of such experiments can then be evaluated to determine whether the system can provide an adequate quality of service, or any service at all, when faced with unexpected or unplanned faults. For example, chaos engineering principles can be used to verify that a redundant system architecture provides an acceptable level of service in response to a failure of one or more components. As another example, chaos engineering principles can be used to identity the tipping point(s) at which a system would fail to provide adequate service in response to one or more failures or faults in the system. BRIEF DESCRIPTION OF THE DRAWINGS Many aspects of the present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, with emphasis instead being placed upon clearly illustrating the principles of the disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views. FIG. 1 is a schematic block diagram of a cloud provider network according to various embodiments of the present disclosure. FIG. 2 is a schematic block diagram of an application deployed within the cloud provider network of FIG. 1 according to various embodiments of the present disclosure. FIG. 3 is a flowchart illustrating one example of functionality implemented within the cloud provider network of FIG. 1 according to various embodiments of the present disclosure. FIG. 4 is a flowchart illustrating one example of functionality implemented within the cloud provider network of FIG. 1 according to various embodiments of the present disclosure. DETAILED DESCRIPTION Although chaos engineering principles are often used to identify the impact of faults or failures within a system, some faults or failures only present themselves within the system, or only have a noticeable or perceptible impact on the system, when the system is placed under load. Some of these faults or failures may be metastable faults or failures, wherein the fault or failure continues to exist within the system after the trigger for the metastable fault or failure is removed, such as when the system is no longer under load. Other faults or failures may be non-metastable, where the fault or failure resolves itself or otherwise disappears after the trigger for the non-metastable fault or failure is removed, such as when the system is no longer under load. Metastable faults or failures often result from an unforeseen or unintended feedback loop that is usually associated with the exhaustion of one or more resources. For example, often times retrying failed requests is done to resolve transient issues in a system (e.g., temporary network congestion or system load). However, retrying failed requests can result in increasing load on a system as failed requests are retried while new or incoming requests are simultaneously processed, resulting a new, steady state of requests that overloads the system. Once overloaded, the system will continue to remain overloaded as failed retry requests result in additional retry requests that will eventually fail due to the load on the system. The system could remain in this metastable fault or failure state indefinitely. As another example, failover systems can cause metastable faults or failures. When a first instance of a system resource (e.g., a database, block storage service, object storage service, web service, application service, etc.) fails and becomes unavailable, a failover system can reroute requests to the first instance of the system resource to the second instance of the system resource. This can cause cascading failures by rerouting the requests or network traffic that caused the first instance of the system resource to fail over to the second system resource. Failures can continue to cascade as previously failed system resources are restored the failover system reroutes the requests back to the restored systems from the currently unavailable secondary or redundant system resources. Although these examples illustrate the concept of a metastable fault or failure, other types of scenarios can result in a metastable fault or failure of a system, service, or application. Non-metastable faults or failures often result from unforeseen behaviors when a system is operating at load. As an illustrative example of a non-metastable fault or failure would be database connections issues that are masked by a cache. Due to caching, connection issues with a database may not be apparent when only a few requests are made to a system, service, application, etc. because responses could be prepared using res