CN-122027443-A - Service system fault root cause analysis method and device based on AGENT large model and storage medium
Abstract
The invention provides a service system fault root cause analysis method and device based on an AGENT large model and a storage medium, and relates to the technical field of IT operation and maintenance support. The method comprises the steps of collecting original alarm data of a service system to be analyzed, carrying out structural processing to obtain aggregated alarm data, carrying out hierarchical node fault analysis on the aggregated alarm data, carrying out entity fault propagation analysis by combining software and hardware topological relations to obtain fault propagation path information, judging fault grades according to the information and grade division rules to generate diagnosis results, classifying associated alarms according to service architecture layers, extracting each hierarchical topological association relation and converting the hierarchical topological association relation into topological text information, filtering the alarms and extracting core features to generate a hierarchical summary report based on an AGENT large model and hierarchical screening rules, and completing fault root cause positioning and influence range analysis by combining the report, the diagnosis results and an operation and maintenance knowledge base. The integrity and consistency of analysis are improved, and a one-stop solution is provided for operation and maintenance personnel.
Inventors
- DENG GAOQIANG
Assignees
- 北京思特奇信息技术股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251231
Claims (10)
- 1. A service system fault root cause analysis method based on an AGENT large model is characterized by comprising the following steps: collecting original alarm data from a service system to be analyzed, and carrying out data structuring treatment on the original alarm data to obtain aggregated alarm data; Performing hierarchical node fault analysis on the aggregated alarm data to obtain a node fault identification result, and performing entity fault propagation analysis on the node fault identification result based on a software-hardware topological relation to obtain fault propagation path information; Performing fault grading based on the fault propagation path information and a fault grading rule to obtain a fault diagnosis result comprising a fault existence state, a propagation path and a grade; Carrying out hierarchical classification on the associated alarms according to different service architectures based on the fault diagnosis result to obtain alarm data of each hierarchy, extracting topological association relations corresponding to the alarm data of each hierarchy based on a system full-scale software and hardware topological graph, and converting the topological association relations into topological text information to obtain topological text information of each hierarchy; carrying out layered alarm filtering on each layered alarm data based on an AGENT large model and combining with a preset layered screening rule to obtain core alarm data, and extracting characteristics of the core alarm data based on topological character information to obtain a layered summarization report; And carrying out fault root positioning and influence range analysis based on the layered summary report, the fault diagnosis result and an operation and maintenance knowledge base to obtain a fault root tracing and influence assessment result.
- 2. The method for analyzing the root cause of a fault in a service system according to claim 1, wherein collecting original alarm data from the service system to be analyzed, and performing data structuring processing on the original alarm data to obtain aggregated alarm data, includes: Collecting original alarm data from a service system to be analyzed, performing dirty data filtering processing on the original alarm data, and performing field standardization processing on the filtered alarm data to obtain standardized alarm data; and for the standardized alarm data, combining a plurality of pieces of same alarm data generated by the same equipment by taking the equipment entity as a unit, counting the alarm repetition number and the alarm acquisition value range, and generating aggregated alarm data based on the combined alarm data, the alarm repetition number and the alarm acquisition value range.
- 3. The method for analyzing the root cause of a fault in a service system according to claim 1, wherein performing hierarchical node fault analysis on the aggregated alarm data to obtain a node fault identification result, performing entity fault propagation analysis on the node fault identification result based on a software-hardware topological relation to obtain fault propagation path information, includes: Performing association matching on the alarm information in the aggregated alarm data and node types of a service system, wherein the node types at least comprise a service node, a process node, a virtual machine node, a physical machine node and a middleware node, and determining a target node corresponding to each alarm information; According to preset IaaS layer, paaS layer and SaaS layer service architecture level, respectively performing fault judgment on target nodes of each level, wherein the IaaS layer is a bottom layer base node, the PaaS layer and the SaaS layer are upper layer service nodes, and the fault judgment process comprises the following steps: If the direct influence index of the target node alarms and the continuous alarm times reach a preset direct fault threshold value, judging the target node as the direct fault node, if the indirect influence index of the target node alarms and the alarm times reach a preset indirect fault threshold value and the indirect influence index belongs to a preset user attention index list, judging the target node as the indirect fault node; Summarizing the information of the direct fault node and the indirect fault node of the IaaS layer, the PaaS layer and the SaaS layer, determining the fault node core attribute of each fault node based on the information of the direct fault node and the indirect fault node, and generating a node fault identification result; The method comprises the steps of calling software and hardware topological relation data of a service system to be analyzed, wherein the software and hardware topological relation data comprise attribute information of nodes of each level and calling, deployment and contained dependency relations among the nodes; And determining a conduction link of the fault from the bottom base node to the upper business node according to the dependency relationship among the nodes in the software and hardware topological relation data by means of path traversal and node association matching by taking each fault node in the node fault identification result as an initial node, and recording the middle conduction node, the end node, the inter-node dependency type and the propagation sequence and time sequence information in each conduction link to form fault propagation path information comprising a complete propagation link.
- 4. The service system fault root cause analysis method according to claim 3, wherein performing fault classification based on the fault propagation path information and a fault classification rule to obtain a fault diagnosis result including a fault existence state, a propagation path, and a class, comprises: Configuring a fault level dividing rule, wherein the fault level dividing rule comprises a severity level, an importance level and a general level, and the severity level judging condition is that the duty ratio of a fault node supporting a core function of a service system reaches a preset duty ratio threshold or the service interruption time exceeds a preset time threshold; Collecting real-time operation data of business service, and quantifying the influence degree of faults on business functions based on fault propagation path information and the real-time operation data in a node association mapping mode; Matching and matching the influence degree of the faults on the service functions with the fault grading rule to obtain a fault diagnosis result comprising the fault existence state, the propagation path and the grade.
- 5. The method for analyzing the root cause of a fault in a service system according to claim 4, wherein the step of classifying the associated alarms according to different service architectures based on the fault diagnosis result to obtain alarm data of each hierarchy, extracting a topology association relationship corresponding to each hierarchy alarm data based on a system full software and hardware topology map, and converting the topology association relationship into topology text information to obtain topology text information of each hierarchy includes: Based on the fault propagation path and fault node information in the fault diagnosis result, the associated alarm data associated with the current fault is screened from the historical alarm data of the service system to be analyzed; According to preset IaaS layer, paaS layer and SaaS layer service architecture hierarchical division rules, and in combination with the hierarchy to which the fault node corresponding to the associated alarm data belongs, the associated alarm data are respectively classified into IaaS layer alarm data, paaS layer alarm data and SaaS layer alarm data in a hierarchical label matching mode, so that each layered alarm data is obtained; For each layered alarm data, intercepting a topology segment directly or indirectly associated with the layered alarm node from a system full software and hardware topological graph, and extracting a topology association relationship among nodes in the segment in a node attribute matching and dependency relationship analysis mode, wherein the topology association relationship is deployment, calling and resource dependency association relationship; Converting the extracted topological association relation into a structured text description, wherein the text description comprises node names, node types, node belongings and association modes among nodes, and generating topological text information corresponding to each hierarchy.
- 6. The method for analyzing the root cause of a fault in a service system according to claim 5, wherein the step of performing hierarchical alarm filtering on each hierarchical alarm data based on an AGENT big model and in combination with a preset hierarchical screening rule to obtain core alarm data, and extracting features of the core alarm data based on topological text information to obtain a hierarchical summary report includes: Respectively configuring differentiated alarm screening rules for an IaaS layer, a PaaS layer and a SaaS layer, wherein the alarm screening rules of the IaaS layer comprise the steps of reserving hardware resource alarms within preset time length before and after a fault and rejecting alarms of retired equipment; the alarm screening rule of the PaaS layer comprises a middleware alarm with reserved resource utilization rate fluctuation exceeding a preset fluctuation threshold and an alarm in a normal maintenance window, wherein the alarm screening rule of the SaaS layer comprises an alarm with reserved core service modules and an alarm with a gray version service removed; Matching the alarm data of each layer with the alarm screening rules of the corresponding layers respectively based on the AGENT large model so as to remove irrelevant alarm data and obtain core alarm data of each layer; Based on an AGENT large model and combining topological text information corresponding to each layering, extracting core characteristics of each core alarm data, wherein the extracted core characteristics comprise alarm occurrence time distribution, fault nodes with highest alarm frequency, association relation between alarms and service modules and dependent links corresponding to the alarms; and integrating alarm analysis information of each layer according to a preset report template based on the extracted core features to generate a layered summary report containing the alarm core features of the IaaS layer, the PaaS layer and the SaaS layer and associated topological text information.
- 7. The method for analyzing the root cause of a fault in a service system according to claim 1, wherein the step of performing the root cause positioning and the influence range analysis based on the hierarchical summary report, the fault diagnosis result and the operation and maintenance knowledge base to obtain the root cause tracing and the influence evaluation result comprises the steps of: The method comprises the steps of calling a preset operation and maintenance knowledge base, wherein the operation and maintenance knowledge base comprises historical fault cases, root cause judging rules, influence evaluation standards and solution libraries, carrying out feature similarity matching on the core features of each level of alarms in a hierarchical summary report, fault propagation paths and grade information in fault diagnosis results and historical fault data in the operation and maintenance knowledge base through an AGENT large model, and extracting historical root cause analysis experience and influence evaluation reference data with the association degree exceeding a preset matching threshold; The method comprises the steps of integrating alarm core characteristics of an IaaS layer, a PaaS layer and a SaaS layer in a layering summarization report by an AGENT large model, determining a source node of a fault by combining fault propagation path information in a fault diagnosis result; Based on the software and hardware topological association relation of the root cause node and the fault propagation path information, quantitatively analyzing the fault influence in a mode of traversing the node radiation range and mapping the service association to obtain an influence level; And carrying out structural integration on the root cause positioning result and the influence level, and generating a fault root cause tracing and influence assessment result according to a preset standardized format.
- 8. An AGENT large model-based service system fault root cause analysis device is characterized by comprising: The alarm preprocessing module is used for collecting original alarm data from a service system to be analyzed, and carrying out data structuring processing on the original alarm data to obtain aggregated alarm data; The service fault diagnosis module is used for carrying out hierarchical node fault analysis on the aggregation alarm data to obtain a node fault identification result, and carrying out entity fault propagation analysis on the node fault identification result based on a software-hardware topological relation to obtain fault propagation path information; Performing fault grading based on the fault propagation path information and a fault grading rule to obtain a fault diagnosis result comprising a fault existence state, a propagation path and a grade; Carrying out hierarchical classification on the associated alarms according to different service architectures based on the fault diagnosis result to obtain alarm data of each hierarchy, extracting topological association relations corresponding to the alarm data of each hierarchy based on a system full-scale software and hardware topological graph, and converting the topological association relations into topological text information to obtain topological text information of each hierarchy; carrying out layered alarm filtering on each layered alarm data based on an AGENT large model and combining with a preset layered screening rule to obtain core alarm data, and extracting characteristics of the core alarm data based on topological character information to obtain a layered summarization report; And the fault root cause analysis module is used for carrying out fault root cause positioning and influence range analysis based on the layered summary report, the fault diagnosis result and the operation and maintenance knowledge base to obtain a fault root cause tracing and influence assessment result.
- 9. A service system fault root cause analysis device based on an AGENT large model, characterized by comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the service system fault root cause analysis method based on an AGENT large model according to any one of claims 1 to 7 when executing the computer program.
- 10. A computer readable storage medium storing a computer program, characterized in that the method for analyzing root cause of service system failure based on AGENT big model according to any of claims 1 to 7 is implemented when the computer program is executed by a processor.
Description
Service system fault root cause analysis method and device based on AGENT large model and storage medium Technical Field The invention mainly relates to the technical field of IT operation and maintenance support, in particular to a business system fault root cause analysis method and device based on an AGENT large model and a storage medium. Background With the rapid development of cloud computing and big data technology, a service system gradually evolves to a distributed multi-level architecture (such as an IaaS/PaaS/SaaS three-layer architecture), the number of software and hardware nodes in the system is increased, and calling, deployment and resource dependency relationship among the nodes are increasingly complex. In the operation process of the service system, various problems such as hardware faults, middleware abnormality, service dependence interruption, configuration errors and the like can possibly cause faults, the faults are easily propagated across the layers along the topological dependence relationship, and the massive burst of alarm data and the positioning difficulty of the fault root cause are greatly improved. The existing service system fault root cause analysis technology has the following defects: the alarm data processing efficiency is low, a large amount of dirty data and repeated data are mixed in the original alarm data, and a unified standardized processing flow is lacked, so that the quality of the subsequent analysis data is poor and the interference information is more; The fault node identification and propagation path analysis fracture, the isolation judgment is carried out on the fault node by relying on manual experience, the fault propagation link is traced without combining with the topological relation of the software and hardware of the system, and the conduction logic and the influence range of the fault are difficult to be defined; The subjectivity of fault grading is strong, and lack of objective judgment rules based on quantitative indexes easily causes unreasonable allocation of operation and maintenance resources and delay of high-priority fault response; The alarm characteristics and the historical operation and maintenance experience of the layered architecture are not fully utilized due to insufficient positioning accuracy, and the system is mostly dependent on manual investigation, so that the system is low in efficiency, is easily limited by the experience level of personnel, causes misjudgment or missed judgment, and prolongs the fault repair time; The method has the advantages that the method lacks effective analysis and utilization of topology association information, the topology relationship is presented in a graphical form, and the method is difficult to directly provide resolvable data support for an intelligent analysis model, so that the intelligent level of fault analysis is restricted. Therefore, an analysis method capable of realizing standardized processing of alarm data, accurate tracing of fault propagation paths, objective judgment of fault grades and root cause intelligent positioning is needed, so that the problems of low analysis efficiency, poor precision and insufficient intelligent degree of the fault root cause in the prior art are solved, and the stability and the high efficiency of operation and maintenance of a service system are improved. Disclosure of Invention The invention aims to solve the technical problem of providing a service system fault root cause analysis method, a device and a storage medium based on an AGENT large model aiming at the defects of the prior art. The technical scheme for solving the technical problems is as follows, the service system fault root cause analysis method based on the AGENT large model comprises the following steps: collecting original alarm data from a service system to be analyzed, and carrying out data structuring treatment on the original alarm data to obtain aggregated alarm data; Performing hierarchical node fault analysis on the aggregated alarm data to obtain a node fault identification result, and performing entity fault propagation analysis on the node fault identification result based on a software-hardware topological relation to obtain fault propagation path information; Performing fault grading based on the fault propagation path information and a fault grading rule to obtain a fault diagnosis result comprising a fault existence state, a propagation path and a grade; Carrying out hierarchical classification on the associated alarms according to different service architectures based on the fault diagnosis result to obtain alarm data of each hierarchy, extracting topological association relations corresponding to the alarm data of each hierarchy based on a system full-scale software and hardware topological graph, and converting the topological association relations into topological text information to obtain topological text information of each hierarchy; carrying out layered ala