US-12621335-B2 - Systems and methods for testing distributed systems using injected network partitions
Abstract
Disclosed herein are systems and method for testing distributed systems using injected network partitions. A method may include monitoring communication between a plurality of computing devices in a distributed system to identify each communication link that exists between two respective computing devices in the distributed system. The method may include generating a communications list comprising a plurality of computing device pairs and injecting a network partition in at least one pair of the plurality of computing device pairs. The method may include detecting whether a performance degradation greater than a threshold performance occurs in response to the network partition. In response to detecting the performance degradation greater than the threshold performance, the method may include generating and transmitting a security report indicative of the performance degradation and the at least one pair of the plurality of computing device pairs causing the performance degradation.
Inventors
- Seba Tayser Khaleel
- Sreeharsha Udayashankar
- Samer Al-Kiswany
- Serg Bell
- Stanislav Protasov
Assignees
- ACRONIS INTERNATIONAL GMBH
Dates
- Publication Date
- 20260505
- Application Date
- 20231229
Claims (17)
- 1 . A method for testing a distributed system, the method comprising: monitoring communication between a plurality of computing devices in a distributed system using a test operation; identifying, based on the monitoring, each communication link that exists between two respective computing devices in the distributed system; generating a communications list comprising a plurality of computing device pairs, wherein each pair comprises two of the plurality of computing devices directly connected by a respective communication link; injecting a network partition in at least one pair of the plurality of computing device pairs, wherein injecting the network partition comprises utilizing software-defined networking (SDN) rules to cut communication between computing devices of the at least one pair; detecting whether a performance degradation greater than a threshold performance occurs in response to the network partition; in response to detecting the performance degradation greater than the threshold performance, generating a security report indicative of the performance degradation and the at least one pair of the plurality of computing device pairs causing the performance degradation; and transmitting the security report to a device of a user associated with the distributed system.
- 2 . The method of claim 1 , further comprising: in response to not detecting the performance degradation greater than the threshold performance, reinstating a communication link between the at least one pair of the plurality of computing device pairs; and injecting the network partition in at least one different pair of the plurality of computing device pairs.
- 3 . The method of claim 2 , further comprising iterating through all pairs in the communications list by injecting the network partition in each pair and combination of pairs while assessing for performance degradation after each injection.
- 4 . The method of claim 1 , wherein the test operation comprises one or more of: reading or writing data from a storage device of the distributed system, producing and processing messages in a message queuing system of the distributed system, and reading or writing to a map in a distributed data structure of the distributed system.
- 5 . The method of claim 1 , wherein detecting whether the performance degradation greater than the threshold performance occurs comprises: calculating a respective performance value for each computing device of the plurality of computing devices during operation of a job when no network partition is injected; calculating a system-wide performance value of the distributed system using each calculated respective performance value; calculating another respective performance value for each computing device of the plurality of computing devices during operation of the job when the network partition is injected; calculating another system-wide performance value of the distributed system using each calculated respective performance value for when the network partition is injected; and calculating a difference between the system-wide performance value and the another system-wide performance value.
- 6 . The method of claim 5 , wherein calculating the respective performance value for each computing device comprises: retrieving one or more logs that include information about each computing device during the job, wherein the information comprises hardware performance information, network performance information, and job performance information; and generating the respective performance value based on the information in the one or more logs.
- 7 . The method of claim 1 , wherein injecting the network partition comprises changing iptables configuration of the at least one pair to cut communication between computing devices of the at least one pair.
- 8 . The method of claim 1 , further comprising monitoring, subsequent to injecting the network partition, for attributes indicative of critical issues, wherein the attributes comprises one or more of: a job failure, a device failure, data loss, dropped packets, freezes.
- 9 . A system for testing a distributed system, comprising: at least one memory; and at least one hardware processor coupled with the at least one memory and configured, individually or in combination, to: monitor communication between a plurality of computing devices in a distributed system using a test operation; identify, based on the monitoring, each communication link that exists between two respective computing devices in the distributed system; generate a communications list comprising a plurality of computing device pairs, wherein each pair comprises two of the plurality of computing devices directly connected by a respective communication link; inject a network partition in at least one pair of the plurality of computing device pairs, wherein injecting the network partition comprises utilizing software-defined networking (SDN) rules to cut communication between computing devices of the at least one pair; detect whether a performance degradation greater than a threshold performance occurs in response to the network partition; in response to detecting the performance degradation greater than the threshold performance, generate a security report indicative of the performance degradation and the at least one pair of the plurality of computing device pairs causing the performance degradation; and transmit the security report to a device of a user associated with the distributed system.
- 10 . The system of claim 9 , wherein the at least one hardware processor is configured to: in response to not detecting the performance degradation greater than the threshold performance, reinstate a communication link between the at least one pair of the plurality of computing device pairs; and inject the network partition in at least one different pair of the plurality of computing device pairs.
- 11 . The system of claim 10 , wherein the at least one hardware processor is configured to iterate through all pairs in the communications list by injecting the network partition in each pair and combination of pairs while assessing for performance degradation after each injection.
- 12 . The system of claim 9 , wherein the test operation comprises one or more of: reading or writing data from a storage device of the distributed system, producing and processing messages in a message queuing system of the distributed system, and reading or writing to a map in a distributed data structure of the distributed system.
- 13 . The system of claim 9 , wherein the at least one hardware processor is configured to detect whether the performance degradation greater than the threshold performance by: calculating a respective performance value for each computing device of the plurality of computing devices during operation of a job when no network partition is injected; calculating a system-wide performance value of the distributed system using each calculated respective performance value; calculating another respective performance value for each computing device of the plurality of computing devices during operation of the job when the network partition is injected; calculating another system-wide performance value of the distributed system using each calculated respective performance value for when the network partition is injected; and calculating a difference between the system-wide performance value and the another system-wide performance value.
- 14 . The system of claim 13 , wherein the at least one hardware processor is configured to calculating the respective performance value for each computing device by: retrieving one or more logs that include information about each computing device during the job, wherein the information comprises hardware performance information, network performance information, and job performance information; and generating the respective performance value based on the information in the one or more logs.
- 15 . The system of claim 9 , wherein the at least one hardware processor is configured to inject the network partition by changing iptables configuration of the at least one pair to cut communication between computing devices of the at least one pair.
- 16 . The system of claim 9 , wherein the at least one hardware processor is configured to monitor, subsequent to injecting the network partition, for attributes indicative of critical issues, wherein the attributes comprises one or more of: a job failure, a device failure, data loss, dropped packets, freezes.
- 17 . A non-transitory computer readable medium storing thereon computer executable instructions for testing a distributed system, including instructions for: monitoring communication between a plurality of computing devices in a distributed system using a test operation; identifying, based on the monitoring, each communication link that exists between two respective computing devices in the distributed system; generating a communications list comprising a plurality of computing device pairs, wherein each pair comprises two of the plurality of computing devices directly connected by a respective communication link; injecting a network partition in at least one pair of the plurality of computing device pairs, wherein injecting the network partition comprises utilizing software-defined networking (SDN) rules to cut communication between computing devices of the at least one pair; detecting whether a performance degradation greater than a threshold performance occurs in response to the network partition; in response to detecting the performance degradation greater than the threshold performance, generating a security report indicative of the performance degradation and the at least one pair of the plurality of computing device pairs causing the performance degradation; and transmitting the security report to a device of a user associated with the distributed system.
Description
FIELD OF TECHNOLOGY The present disclosure relates to the field of data security, and, more specifically, to systems and methods for testing distributed systems using injected network partitions. BACKGROUND Modern networks are complex and can break. One complex network failure is a network partition, which severs communication between a subset of nodes. Failure reports indicate that this network fault can lead to catastrophic network failures. Even a partial network partition may be deadly. A partial network partition is when some network nodes or devices lose connectivity with certain parts of the network while still being able to communicate with other nodes within the isolated portion. This can result in an asymmetrical network where some nodes are reachable, while others are not, leading to potential issues in data consistency and communication between different parts of the network. SUMMARY In one exemplary aspect, the techniques described herein relate to a method for testing a distributed system, the method including: monitoring communication between a plurality of computing devices in a distributed system using a test operation; identifying, based on the monitoring, each communication link that exists between two respective computing devices in the distributed system; generating a communications list including a plurality of computing device pairs, wherein each pair includes two of the plurality of computing devices directly connected by a respective communication link; injecting a network partition in at least one pair of the plurality of computing device pairs; detecting whether a performance degradation greater than a threshold performance occurs in response to the network partition; in response to detecting the performance degradation greater than the threshold performance, generating a security report indicative of the performance degradation and the at least one pair of the plurality of computing device pairs causing the performance degradation; and transmitting the security report to a device of a user associated with the distributed system. In some aspects, the techniques described herein relate to a method, further including: in response to not detecting the performance degradation greater than the threshold performance, reinstating a communication link between the at least one pair of the plurality of computing device pairs; and injecting the network partition in at least one different pair of the plurality of computing device pairs. In some aspects, the techniques described herein relate to a method, further including iterating through all pairs in the communications list by injecting the network partition in each pair and combination of pairs while assessing for performance degradation after each injection. In some aspects, the techniques described herein relate to a method, wherein the test operation includes one or more of: reading or writing data from a storage device of the distributed system, producing and processing messages in a message queuing system of the distributed system, and reading or writing to a map in a distributed data structure of the distributed system. In some aspects, the techniques described herein relate to a method, wherein detecting whether the performance degradation greater than the threshold performance occurs includes: calculating a respective performance value for each computing device of the plurality of computing devices during operation of a job when no network partition is injected; calculating a system-wide performance value of the distributed system using each calculated respective performance value; calculating another respective performance value for each computing device of the plurality of computing devices during operation of the job when the network partition is injected; calculating another system-wide performance value of the distributed system using each calculated respective performance value for when the network partition is injected; and calculating a difference between the system-wide performance value and the another system-wide performance value. In some aspects, the techniques described herein relate to a method, wherein calculating the respective performance value for each computing device includes: retrieving one or more logs that include information about each computing device during the job, wherein the information includes hardware performance information, network performance information, and job performance information; and generating the respective performance value based on the information in the one or more logs. In some aspects, the techniques described herein relate to a method, wherein injecting the network partition includes changing iptables configuration of the at least one pair to cut communication between computing devices of the at least one pair. In some aspects, the techniques described herein relate to a method, wherein injecting the network partition includes utilizing software-defined networking (SDN) rules to c