Search

CN-121984837-A - Alarm root cause analysis system and method

CN121984837ACN 121984837 ACN121984837 ACN 121984837ACN-121984837-A

Abstract

The invention discloses an alarm root cause analysis system and an alarm root cause analysis method, and belongs to the technical field of cloud computing alarm analysis. The method comprises the steps of responding to an alarm event by a flow arranging platform, matching a data butt joint template according to the alarm type, collecting original operation and maintenance data from a designated operation and maintenance data source by a data butt joint module, carrying out standardized processing to obtain standard operation and maintenance data, generating root cause reasoning prompt words according to the standard operation and maintenance data by the flow arranging platform, inputting a large language model to carry out root cause reasoning to generate candidate root causes, generating a verification query instruction aiming at each candidate root cause by the flow arranging platform and sending the verification query instruction to the data butt joint module, inquiring in the original operation and maintenance data by the data butt joint module based on the instructions to obtain verification evidence and feeding back the verification evidence, screening candidate root causes meeting preset confidence conditions by the verification evidence by the flow arranging platform, and obtaining a root cause analysis result. The invention builds a closed loop from model reasoning to data verification, and solves the defect that the root cause analysis of the cloud computing alarm by the traditional method lacks a verification mechanism.

Inventors

  • ZHU XUEFENG
  • CHENG XIAOZHONG
  • XUE LEI

Assignees

  • 上海安畅网络科技股份有限公司

Dates

Publication Date
20260505
Application Date
20260316

Claims (10)

  1. 1. The alarm root cause analysis system is characterized by comprising a flow arranging platform and a data docking module; the process arranging platform responds to an alarm event of the cloud platform, and matches corresponding data docking templates in a preset template library according to the alarm type of the alarm event, wherein the data docking templates corresponding to different alarm types are stored in the template library; The data docking module collects original operation and maintenance data associated with the alarm event from an operation and maintenance data source appointed by the data docking template, and performs standardized processing on the original operation and maintenance data to obtain standard operation and maintenance data; The process arrangement platform generates root cause reasoning prompt words of the alarm event according to the standard operation and maintenance data, and inputs the root cause reasoning prompt words into a large language model to perform root cause reasoning so as to generate at least one candidate root cause of the alarm event; the flow arranging platform generates a verification query instruction according to the description content of each candidate root cause and sends the verification query instruction to the data docking module; The data docking module queries in the original operation and maintenance data based on the verification query instruction, obtains verification evidence of the candidate root cause and feeds the verification evidence back to the flow arrangement platform; and the process arrangement platform screens out candidate root causes of which the verification evidences meet the preset confidence coefficient conditions from the at least one candidate root cause, and obtains root cause analysis results of the alarm event.
  2. 2. The system according to claim 1, wherein the flow orchestration platform is specifically configured to: generating an initial reasoning prompt word of the alarm event according to the standard operation and maintenance data; based on a preset prompting word optimization rule, supplementing operation and maintenance data required by the alarm type of the alarm event into the initial reasoning prompting word, and generating an optimized reasoning prompting word of the alarm event; and inputting the optimized reasoning prompt word of the alarm event into a large language model for root cause reasoning so as to generate at least one candidate root cause of the alarm event.
  3. 3. The system of claim 1, further comprising a template management module, wherein the template management module is used for managing the template library, wherein the template library is further stored with alarm analysis templates corresponding to different fault scenes, and each alarm analysis template is packaged with a corresponding analysis flow; The template management module is specifically used for: receiving an analysis flow configured for a specified fault scene by a user through a dragging component mode, and packaging the analysis flow configured to be completed in a corresponding alarm analysis template; Configuring an adjustable runtime parameter for each alarm analysis template, wherein the runtime parameter comprises at least one of a data acquisition time range, a model reasoning timeout time and a confidence level condition; And classifying each alarm analysis template according to the resource type and the fault level, and storing the alarm analysis templates into the template library.
  4. 4. A system according to claim 3, wherein the flow orchestration platform is specifically configured to: responding to an alarm event of a cloud platform, and extracting attribute information of the alarm event, wherein the attribute information at least comprises a resource type and a fault level; Based on the resource type and the fault level, matching a corresponding alarm analysis template in the template library; invoking the matched alarm analysis template, providing a visual interface of the adjustable runtime parameters for a user, and receiving the adjustment of the runtime parameters by the user; And executing an analysis flow encapsulated in the alarm analysis template based on the adjusted runtime parameters so as to analyze the alarm root cause of the alarm event.
  5. 5. The system according to claim 1 or 2, further comprising a template management module, wherein the template management module is specifically configured to: Creating data docking templates respectively corresponding to a plurality of operation and maintenance data sources based on a standardized data access protocol, wherein each data docking template comprises a standard query statement and query parameters for the corresponding operation and maintenance data source; and storing each data docking template into the template library, and configuring corresponding version numbers and access right parameters for each data docking template.
  6. 6. The system of claim 1, wherein the data docking module is specifically configured to: the integrity check is carried out on the collected original operation and maintenance data, and comprises the steps of verifying whether a time stamp of the original operation and maintenance data covers a preset complete period before and after an alarm event occurs and verifying whether the original operation and maintenance data contains a predefined key field; Under the condition that the original operation and maintenance data does not pass the integrity check, new original operation and maintenance data are collected from the operation and maintenance data source again; And under the condition that the number of times of re-acquisition reaches the preset number of times and the new original operation and maintenance data still fails the integrity check, generating a data exception notification and feeding back to a visual interface of the flow programming platform.
  7. 7. The system of claim 1, wherein the flow orchestration platform is further to: The root cause analysis result of the alarm event and the verification evidence supporting each root cause are associated on a visual interface; Recording key information in the alarm analysis process of the alarm event, wherein the key information in the alarm analysis process of the alarm event at least comprises the alarm event, a used data docking template, root cause reasoning prompt words input into the large language model, a candidate root cause list, verification evidence and verification results of each candidate root cause and root cause analysis results of the alarm event; and generating a corresponding root cause analysis report based on the key information in the alarm analysis process of the alarm event.
  8. 8. An alarm root cause analysis method, applied to the system according to any one of claims 1 to 7, comprising: Responding to an alarm event of a cloud platform, and matching corresponding data docking templates in a preset template library according to the alarm type of the alarm event, wherein the template library stores data docking templates corresponding to different alarm types; Collecting original operation and maintenance data associated with the alarm event from an operation and maintenance data source appointed by the data docking template, and carrying out standardized processing on the original operation and maintenance data to obtain standard operation and maintenance data; generating root cause reasoning prompt words of the alarm event according to the standard operation and maintenance data, and inputting the root cause reasoning prompt words into a large language model to perform root cause reasoning so as to generate at least one candidate root cause of the alarm event; Generating a verification query instruction according to the description content of each candidate root cause aiming at each candidate root cause; Inquiring in the original operation and maintenance data based on the verification inquiring instruction to obtain verification evidence of the candidate root cause; screening out candidate root causes of which verification evidences meet preset confidence coefficient conditions from the at least one candidate root cause, and obtaining a root cause analysis result of the alarm event.
  9. 9. The method of claim 8, wherein generating root cause inference cue words for the alert event based on the standard operational dimensional data and inputting the root cause inference cue words into a large language model for root cause inference to generate at least one candidate root cause for the alert event comprises: generating an initial reasoning prompt word of the alarm event according to the standard operation and maintenance data; based on a preset prompting word optimization rule, supplementing operation and maintenance data required by the alarm type of the alarm event into the initial reasoning prompting word, and generating an optimized reasoning prompting word of the alarm event; and inputting the optimized reasoning prompt word of the alarm event into a large language model for root cause reasoning so as to generate at least one candidate root cause of the alarm event.
  10. 10. The method of claim 8, wherein the template library further stores alarm analysis templates corresponding to different fault scenarios, wherein each alarm analysis template encapsulates a corresponding analysis flow, and further comprising: receiving an analysis flow configured for a specified fault scene by a user through a dragging component mode, and packaging the analysis flow configured to be completed in a corresponding alarm analysis template; Configuring an adjustable runtime parameter for each alarm analysis template, wherein the runtime parameter comprises at least one of a data acquisition time range, a model reasoning timeout time and a confidence level condition; And classifying each alarm analysis template according to the resource type and the fault level, and storing the alarm analysis templates into the template library.

Description

Alarm root cause analysis system and method Technical Field The invention belongs to the technical field of cloud computing alarm analysis, and particularly relates to an alarm root cause analysis system and method. Background With the wide application of the cloud primary architecture and the continuous improvement of service complexity, alarm data in the cloud environment has the characteristics of explosive growth and complex associated links. In order to ensure service continuity, the operation and maintenance team needs to quickly locate the root cause of the fault and take repair measures in a short time. At present, the common alarm root cause analysis method mainly comprises the following three steps of manually logging in various monitoring and log platforms by operation and maintenance personnel to conduct manual investigation, being low in efficiency and limited by personal experience, reasoning based on a preset rule engine, being high in rule maintenance cost and difficult to adapt to a dynamic and changeable cloud environment, and trying to introduce a large language model to assist in analysis, wherein the model output result cannot be cross-verified with original operation and maintenance data due to lack of a standardized data docking and verification mechanism, and misjudgment risk exists. However, the root cause reasoning process of the related technical scheme lacks a closed-loop verification mechanism. Specifically, the root cause analysis based on the large language model is only estimated according to the input alarm information, so that the reliability of the reasoning result is low, and the secondary confirmation of the manual intervention is still needed. Not only reduces the accuracy of root cause positioning, but also increases the time cost and the operation complexity of operation and maintenance response. Disclosure of Invention The embodiment of the invention aims to provide an alarm root cause analysis system and an alarm root cause analysis method, which can solve the problems existing in the background technology. In order to solve the technical problems, the invention is realized as follows: in a first aspect, an embodiment of the present invention provides an alarm root cause analysis system, including a process arrangement platform and a data docking module; the process arranging platform responds to an alarm event of the cloud platform, and matches corresponding data docking templates in a preset template library according to the alarm type of the alarm event, wherein the data docking templates corresponding to different alarm types are stored in the template library; The data docking module collects original operation and maintenance data associated with the alarm event from an operation and maintenance data source appointed by the data docking template, and performs standardized processing on the original operation and maintenance data to obtain standard operation and maintenance data; The process arrangement platform generates root cause reasoning prompt words of the alarm event according to the standard operation and maintenance data, and inputs the root cause reasoning prompt words into a large language model to perform root cause reasoning so as to generate at least one candidate root cause of the alarm event; the flow arranging platform generates a verification query instruction according to the description content of each candidate root cause and sends the verification query instruction to the data docking module; The data docking module queries in the original operation and maintenance data based on the verification query instruction, obtains verification evidence of the candidate root cause and feeds the verification evidence back to the flow arrangement platform; and the process arrangement platform screens out candidate root causes of which the verification evidences meet the preset confidence coefficient conditions from the at least one candidate root cause, and obtains root cause analysis results of the alarm event. Optionally, the process orchestration platform is specifically configured to: generating an initial reasoning prompt word of the alarm event according to the standard operation and maintenance data; based on a preset prompting word optimization rule, supplementing operation and maintenance data required by the alarm type of the alarm event into the initial reasoning prompting word, and generating an optimized reasoning prompting word of the alarm event; and inputting the optimized reasoning prompt word of the alarm event into a large language model for root cause reasoning so as to generate at least one candidate root cause of the alarm event. The system comprises a template library, a template management module, an alarm analysis module and a control module, wherein the template library is used for storing alarm analysis templates corresponding to different fault scenes; The template management module is specifically used for: receiving an an