CN-122019391-A - Automatic testing method and device for long-stable faults of distributed system

CN122019391ACN 122019391 ACN122019391 ACN 122019391ACN-122019391-A

Abstract

The invention provides a method and a device for automatically testing long-stable faults of a distributed system. The method comprises the steps of obtaining test environment parameters of a distributed system, executing cluster health state inspection, executing preparation operation before fault test based on the test environment inspection result, executing fault injection operation through an integrated I/O test tool, respectively carrying out test point confirmation before and after fault injection, recording error reporting information and terminating test if any confirmation fails, analyzing performance data collected in the test process, calculating performance degradation percentage, data cutoff duration and system response delay, and synchronously marking analysis results and test events in a two-dimensional line graph to realize visual presentation of the test process. The invention can realize the full-flow automation of the long-stable fault test of the distributed system, improve the test efficiency and compatibility and reduce the risk of manual intervention and omission.

Inventors

LI JIAYING
Deng can
ZHANG PENG

Assignees

济南浪潮数据技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (10)

1. The automatic testing method for the long-stable faults of the distributed system is characterized by comprising the following steps of: S1, acquiring test environment parameters of a distributed system, executing cluster health state inspection, if the inspection passes, continuing a test flow, otherwise, recording error reporting information and terminating the test; s2, based on a test environment inspection result, performing preparation operation before fault test, including creating a storage volume, mounting and formatting a file system, if the preparation is successful, continuing a test flow, otherwise, recording error reporting information and terminating the test; S3, performing fault injection operation through an integrated I/O test tool, and respectively confirming test points before and after fault injection to verify system response and fault recovery state, and if any confirmation fails, recording error reporting information and terminating the test; and S4, analyzing the performance data acquired in the test process, calculating the performance reduction percentage, the data cutoff duration and the system response delay, and synchronously marking the analysis result and the test event in a two-dimensional line graph to realize the visual presentation of the test process.
2. The method of claim 1, wherein S1 further comprises: S11, acquiring the running state, network connectivity and storage resource availability of the cluster nodes through a first preset function; and S12, if the cluster health status check fails, generating a corresponding error log according to the specific failure item and sending an alarm notification.
3. The method of claim 1, wherein S2 further comprises: s21, automatically selecting a file system format according to the type of an operating system of a target machine type, wherein the file system format comprises ext4, xfs or btrfs; S22, executing automatic creation and mounting of the storage volume through a second preset function, and verifying the read-write permission after mounting.
4. The method of claim 1, wherein S3 further comprises: s31, the fault types injected by using the third preset function include, but are not limited to, network delay, storage node failure and process abnormal exit; s32, respectively carrying out system state confirmation after fault injection and after fault recovery through a fourth preset function, wherein the system state confirmation comprises checking system logs, service availability and data consistency.
5. The method as recited in claim 1, further comprising: S5, executing environment cleaning operation after test, including unloading a file system, deleting a storage volume generated by the test and rollback injection fault configuration; S6, packaging the test result to generate a standardized report, wherein the standardized report comprises a test flow log, performance analysis data and a visual chart, and supports exporting to a PDF or HTML format.
6. An automated testing device for long-term stability faults of a distributed system, comprising: The test environment management module is used for acquiring test environment parameters of the distributed system and executing cluster health state inspection, if the inspection passes, continuing the test flow, otherwise, recording error reporting information and terminating the test; the storage preparation module is used for executing preparation operation before fault test based on the test environment inspection result, including creating a storage volume, mounting and formatting a file system, if the preparation is successful, continuing the test flow, otherwise, recording error reporting information and terminating the test; the fault injection and verification module is used for executing fault injection operation through the integrated I/O test tool, and respectively carrying out test point confirmation before and after fault injection to verify the system response and fault recovery state, and if any confirmation fails, recording error reporting information and terminating the test; and the performance analysis and visualization module is used for analyzing the performance data acquired in the test process, calculating the performance reduction percentage, the data cut-off duration and the system response delay, and synchronously labeling the analysis result and the test event in a two-dimensional line graph to realize the visual presentation of the test process.
7. The apparatus of claim 6, wherein the test environment management module is further to: Acquiring the running state, network connectivity and storage resource availability of the cluster nodes through a first preset function; if the cluster health status check fails, generating a corresponding error log according to the specific failure item and sending an alarm notification.
8. The apparatus of claim 6, wherein the storage preparation module is further to: automatically selecting a file system format according to the type of an operating system of a target machine type, wherein the file system format comprises ext4, xfs or btrfs; and executing automatic creation and mounting of the storage volume through a second preset function, and verifying the read-write permission after mounting.
9. The apparatus of claim 8, wherein the fault injection and verification module is further to: fault types injected using the third preset function include, but are not limited to, network delay, storage node failure, process exception exit; And respectively carrying out system state confirmation after fault injection and after fault recovery through a third preset function, wherein the system state confirmation comprises checking system logs, service availability and data consistency.
10. The apparatus as recited in claim 6, further comprising: The environment cleaning module is used for executing the environment cleaning operation after the test, and comprises unloading the file system, deleting the storage volume generated by the test and rollback injection fault configuration; And the report generation module is used for packaging the test result to generate a standardized report, comprising a test flow log, performance analysis data and a visual chart, and supporting export into a PDF or HTML format.

Description

Automatic testing method and device for long-stable faults of distributed system Technical Field The disclosure relates to the field of automatic long-stable fault testing of distributed systems, in particular to a method and a device for automatic long-stable fault testing of a distributed system. Background The distributed system is used as a core technology in modern cloud computing, big data and virtualization architecture, and is widely applied to the fields of data storage, processing and fault tolerance management of IT enterprises. With the continuous expansion of the scale of the distributed system and the increasing complexity of application scenes, the robustness and fault tolerance verification requirements of the distributed system are remarkably improved. In the related art, long-stability fault test generally depends on manual operation or semi-automatic tools, and relates to a plurality of links such as environment inspection, fault injection, system response confirmation, performance analysis and the like, but the existing test platform mostly adopts discrete modular design, lacks a unified flow control and cooperation mechanism, and is difficult to realize full-flow automation from cluster health state verification to fault recovery verification. Specifically, the testing process includes key steps of data preparation, fault simulation, system response monitoring, performance evaluation, result visualization and the like, wherein environmental pre-diagnosis, fault injection control and cross-platform compatibility are particularly critical. However, in the existing long-stable fault testing method, a manual or single-point automatic flow is directly adopted, and a standardized modularized testing framework is not constructed, so that the testing efficiency is low, the flow is easy to make mistakes, or the testing result is difficult to reproduce, and the robustness verification, the multi-machine compatibility testing and the improvement of the operation and maintenance efficiency of the system are affected. Specifically, conventional test tools lack a real-time validation mechanism for system response after fault injection is performed, and fail to perform integrity verification during the fault recovery stage, resulting in a high test miss rate. In addition, the prior art cannot synchronously mark performance data and test events, the visualization degree of test results is insufficient, behavior characteristics of the system in a long-stable fault scene are difficult to intuitively reflect, and the analysis depth and the application value of the test results are limited. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. The invention provides a long-stable fault automatic test method for a distributed system. Another object of the present invention is to provide an automatic testing device for long-stable faults of a distributed system. To achieve the above objective, an embodiment of a first aspect of the present invention provides an automated testing method for long-stable faults of a distributed system, including: S1, acquiring test environment parameters of a distributed system, executing cluster health state inspection, if the inspection passes, continuing a test flow, otherwise, recording error reporting information and terminating the test; s2, based on a test environment inspection result, performing preparation operation before fault test, including creating a storage volume, mounting and formatting a file system, if the preparation is successful, continuing a test flow, otherwise, recording error reporting information and terminating the test; S3, performing fault injection operation through an integrated I/O test tool, and respectively confirming test points before and after fault injection to verify system response and fault recovery state, and if any confirmation fails, recording error reporting information and terminating the test; and S4, analyzing the performance data acquired in the test process, calculating the performance reduction percentage, the data cutoff duration and the system response delay, and synchronously marking the analysis result and the test event in a two-dimensional line graph to realize the visual presentation of the test process. In one embodiment of the present invention, the S1 further includes: S11, acquiring the running state, network connectivity and storage resource availability of the cluster nodes through a first preset function; and S12, if the cluster health status check fails, generating a corresponding error log according to the specific failure item and sending an alarm notification. In one embodiment of the present invention, the S2 further includes: s21, automatically selecting a file system format according to the type of an operating system of a target machine type, wherein the file system format comprises ext4, xfs or btrfs; S22, executing automati