Search

CN-122001744-A - Intelligent operation and maintenance fault processing method and system in multi-cloud network environment

CN122001744ACN 122001744 ACN122001744 ACN 122001744ACN-122001744-A

Abstract

The application provides an intelligent operation and maintenance fault processing method and system in a multi-cloud network environment, and relates to the technical field of intelligent operation and maintenance. The technical scheme provided by the application determines the topological connection relation between the alarm objects based on the pre-constructed operation and maintenance knowledge graph, and aggregates the alarm objects meeting the preset time window condition and having the topological connection relation into the target fault event group, thereby realizing the effective compression and association analysis of mass alarm information. The generated artificial intelligent model is utilized to process the fault analysis prompt word to obtain a fault root cause analysis result, and an emergency disposal script is automatically generated, so that automatic generation from fault diagnosis to repair scheme is realized, and a great amount of manual analysis is not required by operation and maintenance personnel. The cloud platform interface is called to execute the repair operation in the emergency treatment script, so that the fault root cause analysis and processing can be rapidly and accurately realized, and the operation and maintenance effect is improved.

Inventors

  • Request for anonymity
  • Request for anonymity

Assignees

  • 北京联池系统科技有限公司

Dates

Publication Date
20260508
Application Date
20260225

Claims (10)

  1. 1. An intelligent operation and maintenance fault processing method in a multi-cloud network environment is characterized by comprising the following steps: Acquiring multi-source alarm data in a multi-cloud network environment, and converting the multi-source alarm data into standard event data in a standard format; Based on a pre-constructed operation and maintenance knowledge graph, determining the topological connection relation between alarm objects in the standard event data, and aggregating a plurality of alarm objects which meet the preset time window condition and have the topological connection relation into a target fault event group; Extracting characteristic information of the target fault event group, and searching historical operation and maintenance knowledge data matched with the characteristic information in an operation and maintenance knowledge base; combining event description information of the target fault event group, the corresponding topological connection relation and the historical operation and maintenance knowledge data into a fault analysis prompt word; Inputting the fault analysis prompt word into a pre-trained generation type artificial intelligent model to obtain a fault root cause analysis result, and generating an emergency disposal script according to the fault root cause analysis result; And in response to a confirmation execution instruction of the user on the emergency treatment scenario, invoking a cloud platform interface to execute a repair operation in the emergency treatment scenario.
  2. 2. The method of claim 1, wherein said converting said multi-source alert data into standard event data in a standard format comprises: analyzing the multi-source alarm data, and extracting alarm object identification, event occurrence time and event description information in a standard format; Determining a virtual private cloud and an available area where the alarm object belongs according to the alarm object identifier; And taking the virtual private cloud and the available area as context labels, and sub-packaging the context labels, the alarm object identifiers, the event occurrence time and the event description information into standard event data.
  3. 3. The method according to claim 1, wherein determining the topological connection relationship between the alarm objects in the standard event data based on the pre-constructed operation and maintenance knowledge graph comprises: According to the alarm object identification in the standard event data, positioning map nodes corresponding to the alarm objects in a pre-constructed operation and maintenance knowledge map; traversing the association path of the map nodes in the operation and maintenance knowledge map, and identifying the physical deployment attribution relation or the logical service calling dependency relation between the map nodes; And determining the identified physical deployment attribution relation or the identified logical service invocation dependency relation as a topological connection relation between the alarm objects.
  4. 4. The method according to claim 1, wherein the aggregating the plurality of alarm objects satisfying a preset time window condition and having the topological connection relationship into a target fault event group comprises: screening out a plurality of candidate event data in the preset time window according to the event occurrence time in the standard event data; inquiring the map node positions of all the alarm objects in the candidate event data in the operation and maintenance knowledge map, and identifying the alarm objects topologically related to the same infrastructure node; Clustering a plurality of candidate event data related to the same infrastructure node to generate an accident ticket containing a fault title and a priority label; and combining the plurality of candidate event data associated in the accident ticket into the target fault event group.
  5. 5. The method of claim 1, wherein the combining the event description information of the target fault event group, the corresponding topological connection relationship, and the historical operational and maintenance knowledge data into a fault analysis hint comprises: Constructing a prompt word template comprising operation and maintenance expert role setting and fault root cause analysis task instructions; Converting the event description information of the target fault event group and the corresponding topological connection relation into a structured text, and filling the structured text into a fault scene description area of the prompt word template; and filling the historical operation and maintenance knowledge data into a knowledge enhancement area of the prompt word template by taking the historical operation and maintenance knowledge data as a reference basis to generate a fault analysis prompt word.
  6. 6. The method of claim 1, wherein generating an emergency disposition scenario from the fault root analysis results comprises: analyzing the fault root cause analysis result, and determining a target cloud resource object to be repaired and a corresponding repair operation type; Generating an automatic execution code for calling a cloud platform interface to execute the repair operation type by using the generated artificial intelligent model; and packaging the automatic execution codes and the corresponding execution sequence parameters into an emergency treatment script.
  7. 7. The method of claim 1, wherein the step of inputting the fault analysis prompt word into a pre-trained generated artificial intelligence model to obtain a fault root cause analysis result further comprises: Extracting confidence scores aiming at the fault root cause analysis results, and determining the credibility level of the fault root cause analysis results; And outputting the confidence level and the confidence value to a confirmation interface of the user.
  8. 8. An intelligent operation and maintenance fault processing system in a multi-cloud network environment, the system comprising: The data acquisition module is used for acquiring multi-source alarm data in a multi-cloud network environment and converting the multi-source alarm data into standard event data in a standard format; The data aggregation module is used for determining the topological connection relation between the alarm objects in the standard event data based on a pre-constructed operation and maintenance knowledge graph, and aggregating a plurality of alarm objects which meet the preset time window condition and have the topological connection relation into a target fault event group; The historical data retrieval module is used for extracting the characteristic information of the target fault event group and retrieving historical operation and maintenance knowledge data matched with the characteristic information in an operation and maintenance knowledge base; The prompt word generation module is used for combining the event description information of the target fault event group, the corresponding topological connection relation and the historical operation and maintenance knowledge data into a fault analysis prompt word; the root cause analysis module is used for inputting the fault analysis prompt word into a pre-trained generation type artificial intelligent model to obtain a fault root cause analysis result, and generating an emergency disposal script according to the fault root cause analysis result; and the operation module is used for responding to the confirmation execution instruction of the user on the emergency treatment script and calling a cloud platform interface to execute the repair operation in the emergency treatment script.
  9. 9. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method of any one of claims 1 to 7.
  10. 10. An electronic device comprising a processor, a memory and a transceiver, the memory configured to store instructions, the transceiver configured to communicate with other devices, the processor configured to execute the instructions stored in the memory, to cause the electronic device to perform the method of any one of claims 1-7.

Description

Intelligent operation and maintenance fault processing method and system in multi-cloud network environment Technical Field The application relates to the technical field of intelligent operation and maintenance, in particular to an intelligent operation and maintenance fault processing method and system in a multi-cloud network environment. Background With the deep advancement of enterprise digital transformation, a business system is generally deployed by adopting a hybrid multi-cloud architecture. When an enterprise operates and maintains a complex multi-cloud network, infrastructure resources from different cloud service providers need to be managed simultaneously, and large-scale network nodes distributed in a plurality of data centers and cloud platforms are monitored. Complex topological connection relations and business dependency relations exist among various cloud resources in the network environment, when faults occur, resource anomalies of multiple levels are often involved, and great challenges are brought to operation and maintenance management. The most common mode in the prior art is to collect alarm information through a monitoring system, and manually analyze and troubleshoot the alarm information according to experience by operation and maintenance personnel. However, in the mixed multi-cloud and large-scale distributed network environment, the method is faced with massive and heterogeneous monitoring alarms and event information, and as the data volume is too large, operation and maintenance personnel cannot easily understand the association relation between alarms in a short time, cannot quickly locate the root cause of the fault, and also cannot accurately judge the influence range and propagation path of the fault, so that quick and accurate operation and maintenance response is difficult to realize, the fault processing time is prolonged, the repair efficiency is low, and the operation and maintenance effect is poor. Disclosure of Invention The application provides an intelligent operation and maintenance fault processing method and system in a multi-cloud network environment, which can rapidly and accurately analyze and process the root cause of a fault, thereby improving the operation and maintenance effect. In a first aspect, the present application provides a method for processing an intelligent operation and maintenance fault in a multi-cloud network environment, where the method includes: acquiring multi-source alarm data in a multi-cloud network environment, and converting the multi-source alarm data into standard event data in a standard format; Based on a pre-constructed operation and maintenance knowledge graph, determining the topological connection relation between alarm objects in standard event data, and aggregating a plurality of alarm objects meeting the preset time window condition and having the topological connection relation into a target fault event group; Extracting characteristic information of a target fault event group, and searching historical operation and maintenance knowledge data matched with the characteristic information in an operation and maintenance knowledge base; Combining event description information, corresponding topological connection relation and historical operation and maintenance knowledge data of a target fault event group into a fault analysis prompt word; Inputting fault analysis prompt words into a pre-trained generation type artificial intelligent model to obtain a fault root cause analysis result, and generating an emergency disposal script according to the fault root cause analysis result; and in response to a confirmation execution instruction of the user on the emergency treatment script, invoking the cloud platform interface to execute the repair operation in the emergency treatment script. By adopting the technical scheme, the topological connection relation between the alarm objects is determined based on the pre-constructed operation and maintenance knowledge graph, and the alarm objects meeting the preset time window condition and having the topological connection relation are aggregated into the target fault event group, so that the effective compression and the association analysis of mass alarm information are realized. The historical operation and maintenance knowledge data matched with the characteristic information is searched in the operation and maintenance knowledge base, and the event description information, the topological connection relation and the historical operation and maintenance knowledge data are combined to form a fault analysis prompt word, so that complete context information and experience reference are provided for subsequent analysis. The generated artificial intelligent model is utilized to process the fault analysis prompt word to obtain a fault root cause analysis result, and an emergency disposal script is automatically generated, so that automatic generation from fault diagnosis to repair sche