CN-121478539-B - Method and device for repairing storage system faults, electronic equipment, medium and product
Abstract
The application discloses a method, a device, electronic equipment, a medium and a product for repairing a storage system fault, and relates to the technical field of storage systems. Monitoring the storage system, capturing the alarm event of the storage system, and acquiring the request identification in the alarm event. And when an alarm event is captured, aggregating the track data corresponding to the request mark, and constructing a fault characteristic sequence according to the track data. And obtaining a repair strategy corresponding to the fault characteristic sequence from a pre-constructed fault repair database, so that the repair of the known fault can be realized. The process does not need to manually analyze and call links, presumes the fault reason, and manually intervenes to complete the repair, thereby improving the efficiency and accuracy of fault repair.
Inventors
- JIA WENLIANG
- GAO RUISHENG
- XIE PENG
Assignees
- 苏州元脑智能科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260109
Claims (9)
- 1. A method for repairing a storage system failure, comprising: monitoring a storage system and capturing an alarm event of the storage system; Acquiring a request identifier in the alarm event, wherein the request identifier is generated for an external request of the storage system; aggregating track data corresponding to the request identifier from a pre-constructed request track database, wherein the pre-constructed request track database stores track data of different external requests; determining a sequence with a time sequence corresponding to each link identifier according to the track data corresponding to the request identifier, and constructing a fault characteristic sequence of the alarm event; The method comprises the steps of obtaining a plurality of known fault feature sequences from a pre-constructed fault restoration database, sequentially matching the plurality of known fault feature sequences with the fault feature sequences, wherein the fault feature sequences comprise a request state, a service name and a service name; Under the condition that the sequence length of the known fault characteristic sequence is the same as the sequence length of the fault characteristic sequence, obtaining the matching degree of each known fault characteristic sequence based on the fault characteristic sequence and the time sequence; multiplying the matching degree by a sequence length correction coefficient under the condition that the sequence length of the known fault characteristic sequence is inconsistent with the sequence length of the fault characteristic sequence to obtain the final matching degree; determining a repair strategy corresponding to a known fault feature sequence with the matching degree exceeding a preset matching degree threshold value as the repair strategy corresponding to the fault feature sequence; and repairing the alarm event according to the repairing strategy.
- 2. The method according to claim 1, wherein constructing the fault signature sequence of the alarm event according to the trajectory data corresponding to the request identifier comprises: acquiring a plurality of link identifiers from track data corresponding to the request identifiers, wherein the track data are data of different external requests in a plurality of processing links; acquiring feature vectors corresponding to the link identifiers according to the track data corresponding to the request identifiers; And obtaining a fault characteristic sequence of the alarm event according to the characteristic vector corresponding to each link identifier.
- 3. The method of claim 1, wherein the step of determining the position of the substrate comprises, Correspondingly, the obtaining the matching degree of each known fault characteristic sequence based on the fault characteristic sequence and the time sequence comprises the following steps: acquiring the same service name from the plurality of known fault feature sequences and the fault feature sequence; Determining the time sequence matching degree of each known fault feature sequence according to the same service name and whether the time sequence in each known fault feature sequence is the same as the time sequence of the same service name in the fault feature sequence, wherein the time sequence is the same as the time sequence, and the time sequence is different from the time sequence, and corresponds to different time sequence matching degrees; Determining the request state matching degree of each known fault feature sequence according to the request states corresponding to the same service names in each known fault feature sequence and whether the request states corresponding to the same service names in the fault feature sequence are the same or not, wherein the request states are the same as the request states and correspond to different request state matching degrees; Determining the service name matching degree of each known fault feature sequence according to the service names corresponding to each known fault feature sequence and whether the service names corresponding to the same service names in the fault feature sequence are the same or not, wherein the service names are the same as the service names and the service names are different, and the service names correspond to different service name matching degrees; And obtaining the matching degree of each known fault characteristic sequence through weighted fusion according to the time sequence matching degree, the request state matching degree, the service name matching degree, a preset time sequence matching degree weight, a preset request state matching degree weight and a preset service name matching degree weight.
- 4. The method according to claim 1, wherein the method further comprises: If the pre-constructed fault restoration database does not have the restoration strategy corresponding to the fault characteristic sequence, the fault characteristic sequence is sent to an operation and maintenance end, so that the operation and maintenance end carries out manual restoration according to the fault characteristic sequence; receiving an artificial restoration strategy sent by the operation and maintenance end; and storing the fault characteristic sequence and the manual repair strategy into the pre-constructed fault repair database to update the pre-constructed fault repair database.
- 5. The method according to claim 1, wherein the method further comprises: acquiring a plurality of historical alarm events of the storage system and a historical repair strategy of each historical alarm event; acquiring request identifiers in each historical alarm event; acquiring track data corresponding to the request identifier from a pre-constructed request track database according to the request identifier; Constructing a historical fault feature sequence of each historical alarm event according to the track data corresponding to the request identifier; And storing the historical fault characteristic sequences of the historical alarm events and the historical repair strategies of the historical alarm events in a mapping pair mode to obtain a pre-constructed fault repair database.
- 6. The method according to claim 1, wherein the method further comprises: when the storage system receives any external request, generating the request identifier corresponding to the current external request; when the current external request enters any processing link, collecting a plurality of attribute values of the current external request in the current processing link; And taking the request mark as a main key, and carrying out structural storage on the attribute values of the current external request in each processing link to obtain a pre-constructed request track database.
- 7. The method of claim 6, wherein the plurality of attribute values includes a link identifier; accordingly, the method further comprises: generating the request identifier through a random generator; and generating the link identifier through the random generator.
- 8. The method of claim 6, wherein the method further comprises: converting the pre-constructed request track database into a request track database of a graph structure; in the conversion process, the following steps are performed; the steps include: Acquiring track data corresponding to each request identifier from the pre-constructed request track database; acquiring a plurality of link identifiers from track data corresponding to the request identifiers, wherein the track data are data of different external requests in a plurality of processing links; Determining the link identifiers as a plurality of nodes; Determining the plurality of attribute values corresponding to each link identifier as node attributes of each node, wherein the plurality of attribute values comprise father link identifiers; Determining the edge attribute among the nodes according to the father link identifier; Constructing a graph structure of each request identifier according to the plurality of nodes, the node attribute of each node and the edge attribute among the nodes; and obtaining a request track database of the graph structure according to the graph structure of each request identifier.
- 9. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the steps of the method for repairing a storage system failure according to any of claims 1-8 when executing said computer program.
Description
Method and device for repairing storage system faults, electronic equipment, medium and product Technical Field The present application relates to the field of storage system technologies, and in particular, to a method and apparatus for repairing a storage system failure, an electronic device, a medium, and a product. Background In modern enterprise-level storage systems, as the size of traffic increases and the volume of data increases, the storage system is typically made up of multiple distributed services, databases, and middleware components. When an external request which is initiated by the storage system and needs to be processed by the storage system is sent from the external of the storage system, the components cooperate to complete data storage and processing tasks through a complex call chain. When a storage system fails, such as service timeout and database query failure, operation and maintenance personnel need to locate the failure cause from massive log and alarm data. However, the storage system may generate a large number of log records when running. When an alarm occurs, operation and maintenance personnel need to manually analyze and call information such as links, error codes, time stamps and the like so as to infer possible reasons of faults and take manual intervention measures for repairing. This process is time consuming and labor intensive, and is prone to analysis errors due to human omission, resulting in reduced efficiency and accuracy of fault repair. Disclosure of Invention The application provides a method, a device, electronic equipment, a medium and a product for repairing faults of a storage system, which at least solve the problems of low efficiency and low accuracy of fault repair in the related technology. The application provides a method for repairing faults of a storage system, which comprises the steps of monitoring the storage system, capturing alarm events of the storage system, obtaining request identifications in the alarm events, wherein the request identifications are generated for external requests of the storage system, obtaining track data corresponding to the request identifications from a pre-built request track database, storing track data of different external requests in the pre-built request track database, constructing fault feature sequences of the alarm events according to the track data corresponding to the request identifications, obtaining repair strategies corresponding to the fault feature sequences from the pre-built fault repair database, and repairing the alarm events according to the repair strategies. The application also provides a device for repairing the storage system faults, which comprises a capturing module, a storage system monitoring module and a storage system alarm event capturing module. The first acquisition module is used for acquiring a request identifier in the alarm event, wherein the request identifier is generated for an external request of the storage system. The system comprises a first acquisition module, a second acquisition module and a construction module, wherein the first acquisition module is used for acquiring track data corresponding to a request identifier from a pre-constructed request track database, track data of different external requests are stored in the pre-constructed request track database, and the construction module is used for constructing a fault characteristic sequence of an alarm event according to the track data corresponding to the request identifier. And the third acquisition module is used for acquiring a repair strategy corresponding to the fault characteristic sequence from the pre-constructed fault repair database. And the repair module is used for repairing the alarm event according to the repair strategy. The application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the steps of any one of the repairing methods for the fault of the storage system when executing the computer program. The application also provides a computer readable storage medium, in which a computer program is stored, wherein the computer program when executed by a processor implements the steps of any one of the above-mentioned methods for repairing a storage system failure. The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the above methods for repairing a storage system failure. According to the application, when an external request which is initiated by the storage system and needs to be processed by the storage system is sent from the external of the storage system, the components cooperate to complete data storage and processing tasks through a complex call chain. The storage system is monitored, and the alarm event of the storage system is captured, and because the alarm even