CN-122020585-A - Method, device, equipment and storage medium for analyzing long-term stable use case operation faults

CN122020585ACN 122020585 ACN122020585 ACN 122020585ACN-122020585-A

Abstract

The application discloses an analysis method of a long-stability use case operation fault, which relates to the technical field of fault detection and comprises the steps of starting the long-stability use case, collecting node operation data of at least one storage node and service operation data corresponding to a distributed storage system, judging the operation state of a service according to a preset cutoff rule and the service operation data, determining a fault type according to the preset fault rule and the node operation data of at least one storage node in response to the service being in the cutoff state, obtaining a causal relationship network corresponding to the fault type according to the fault type and the node operation data of at least one storage node, and generating a test report of the long-stability use case according to the node operation data of the storage node, the service operation data, the fault type and the causal relationship network, so that the technical problem of difficult clearing of complex causal links among fault factors is solved, and the technical effects of efficient analysis and accurate positioning of the fault of the storage system are achieved.

Inventors

MA CONG

Assignees

济南浪潮数据技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260116

Claims (10)

1. A method for analyzing a long-term use case operation failure, the method being suitable for a distributed storage system, the distributed storage system including at least one storage node, the method comprising: starting a long-time stable use case, and collecting node operation data of at least one storage node and service operation data corresponding to a distributed storage system; judging the operation state of the service according to a preset cutoff rule and the service operation data; Responding to the business in a cut-off state, and determining a fault type according to a preset fault rule and node operation data of at least one storage node; Obtaining a causal relationship network corresponding to the fault type according to the fault type and node operation data of the at least one storage node; and generating a test report of the long-term stability use case according to the node operation data of the storage node, the service operation data, the fault type and the causal relationship network.
2. The method for analyzing long-term use-case operation failure according to claim 1, wherein said judging the operation state of the service according to a preset cutoff rule and the service operation data comprises: Recording the time of overtime of the service operation corresponding to the service and zero transmission quantity of the service data; Counting the time and judging whether the time is continuous, and judging whether the duration is greater than a time threshold value or not in response to the fact that the time is continuous; And in response to the duration being greater than the time threshold, determining that the operational state of the service is an off state.
3. The method for analyzing long-term use-case operation faults according to claim 1, wherein the node operation data of the storage node comprises hardware data, network data, software data and service data, the preset fault rule comprises a hardware fault rule, a network fault rule, a software fault rule and a service fault rule, and the determining the fault type according to the preset fault rule and the node operation data of the at least one storage node in response to the service being in a cut-off state comprises: Acquiring hardware data of the at least one storage node, and judging whether the distributed storage system has hardware faults or not according to the hardware data and hardware fault rules; acquiring network data of the at least one storage node, and judging and determining whether the distributed storage system has network faults according to the network data and network fault rules; Acquiring software data of the at least one storage node, and judging whether the distributed storage system has a storage fault or not according to the software data and a software fault rule; acquiring service data of the at least one storage node, and judging whether the distributed storage system has service faults or not according to the service data and service fault rules; and determining at least one fault type corresponding to the at least one storage node according to the judging result.
4. The method for analyzing long-term use-case operation faults according to claim 1, wherein the preset fault rule includes a fault threshold, the method further comprising: Acquiring business load data, wherein the business load data comprises time, concurrency and cutoff labels; discretizing the time and the concurrency quantity to obtain discretized time and discretized concurrency quantity; and taking the fault threshold, the cutoff label, the discretized time and the discretized concurrency as input parameters, and inputting a reinforcement learning model trained based on historical business load data to obtain an adjusted fault threshold.
5. The method for analyzing a long-term use case operation fault according to claim 1, wherein the obtaining a causal relationship network corresponding to the fault type according to the fault type and node operation data of the at least one storage node includes: Acquiring historical node operation data and real-time node operation data of the at least one storage node, aligning time axes of the historical node operation data and the real-time node operation data, and slicing according to a preset time length to obtain a time slice; Extracting features of the time slice, wherein the features include trend features and abrupt change features; Calculating the conditional probability between the fault type and the corresponding historical node operation data according to the characteristics of the historical node operation data through a sliding time window; taking fault types existing in the historical operation process as nodes, taking causal dependency relations among the fault types as edges, taking conditional probabilities among the fault types and corresponding historical node operation data as initial weights of the edges, and constructing an initial causal relation network; Calculating the conditional probability between the fault type and the corresponding real-time node operation data according to the characteristics of the real-time node operation data through a sliding time window; And updating the weight of the edge of the initial causal relationship network according to the conditional probability between the fault type and the corresponding real-time node operation data and the conditional probability between the fault type and the corresponding historical node operation data to obtain the causal relationship network.
6. The method for analyzing long-life operation failure according to claim 5, further comprising: Calculating marginal contribution degrees of a plurality of fault types through a Bayesian network in response to the concurrence of the plurality of fault types; and correcting the weight of the edge of the causal relation network according to the marginal contribution degree.
7. The method for analyzing long-term use case operation faults according to claim 1, wherein the generating the test report of the long-term use case according to node operation data of the storage node, the service operation data, the fault type and the causal relationship network comprises: Recording node fault analysis data according to node operation data of the storage nodes and the causal relationship network, wherein the node fault analysis data comprises fault storage nodes, fault occurrence time and fault influence range; Recording service cutout analysis data according to the service operation data, wherein the service cutout analysis data comprises the starting time, duration and influence range of cutout; Matching corresponding fault repair suggestions according to the fault types, wherein different fault types correspond to different repair suggestion libraries; And filling the node fault analysis data, the service interruption analysis data and the fault restoration suggestion into a test report template to generate a test report.
8. An analysis device for long-term use case operation faults, which is suitable for a distributed storage system, wherein the distributed storage system comprises at least one storage node, and the device comprises: the first processing module is used for starting a long-term stability use case and collecting node operation data of at least one storage node and service operation data corresponding to the distributed storage system; the second processing module is used for judging the operation state of the service according to a preset cutoff rule and the service operation data; The third processing module is used for responding to the business in a cut-off state and determining the fault type according to a preset fault rule and node operation data of the at least one storage node; the fourth processing module is used for obtaining a causal relationship network corresponding to the fault type according to the fault type and node operation data of the at least one storage node; and a fifth processing module, configured to generate a test report of the long-stability case according to node operation data of the storage node, the service operation data, the fault type and the causal relationship network.
9. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the steps of the method for analyzing an operation failure of a long-standing case according to any one of claims 1 to 7 when executing a computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method for analyzing an operation failure of a long-term use case according to any one of claims 1 to 7.

Description

Method, device, equipment and storage medium for analyzing long-term stable use case operation faults Technical Field The present invention relates to the field of fault detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for analyzing an operation fault of a long-stable use case. Background With the rapid development of cloud computing and big data technology, a distributed storage system has become a core architecture of modern data processing and storage due to high extensibility, high availability and high fault tolerance. In a stability test that evaluates the long-term reliability of a storage system, continuous operation of a storage long-term stable use case is critical. However, during testing, the storage system may be out of service for a number of reasons, such as hardware failures, software vulnerabilities, network anomalies, etc. At present, fault analysis mainly depends on manual experience or simple monitoring threshold judgment, and the problems of untimely fault discovery, low analysis efficiency and the like exist. The traditional mode is difficult to intelligently identify whether the service really breaks, and the manual investigation is time-consuming and labor-consuming and is easy to misjudge in the face of various fault types in the distributed system. Meanwhile, service interruption in an actual scene is often caused by interleaving of various factors, complex causal links between the conventional monitoring means or simple rules are difficult to clear, and root causes are difficult to accurately locate. Therefore, the invention provides a method for analyzing the operation faults of the long and stable use case aiming at the defects of the prior art scheme. Disclosure of Invention The application provides an analysis method, electronic equipment and a storage medium for long-term use case operation faults, which at least solve the problems that complex causal links among fault factors are difficult to clear and root causes of faults are difficult to accurately locate. The application provides a method for analyzing operation faults of a long-term stable use case, which is suitable for a distributed storage system, wherein the distributed storage system comprises at least one storage node; the method comprises the steps of judging the operation state of a service according to a preset cutoff rule and service operation data, responding to the cutoff state of the service, determining a fault type according to the preset fault rule and node operation data of at least one storage node, obtaining a causal relationship network corresponding to the fault type according to the fault type and the node operation data of at least one storage node, and generating a test report of a long-stability use case according to the node operation data of the storage node, the service operation data, the fault type and the causal relationship network. The application further provides electronic equipment which comprises a memory, wherein the memory is used for starting the long-term use case, collecting node operation data of at least one storage node and service operation data corresponding to a distributed storage system, judging the operation state of the service according to a preset cutoff rule and the service operation data, determining a fault type according to the preset fault rule and the node operation data of at least one storage node in response to the service being in the cutoff state, obtaining a causal relationship network corresponding to the fault type according to the fault type and the node operation data of at least one storage node, and generating a test report of the long-term use case according to the node operation data of the storage node, the service operation data, the fault type and the causal relationship network. The application further provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation steps are as follows, a long-term use case is started, node operation data of at least one storage node and service operation data corresponding to a distributed storage system are collected, the operation state of the service is judged according to a preset cutoff rule and the service operation data, a fault type is determined according to the preset fault rule and the node operation data of at least one storage node in response to the service being in the cutoff state, a causal relationship network corresponding to the fault type is obtained according to the fault type and the node operation data of at least one storage node, and a test report of the long-term use case is generated according to the node operation data of the storage node, the service operation data, the fault type and the causal relationship network. The application also provides a computer program product, which comprises a computer p