CN-122019221-A - Fault self-healing method and system based on business semantic graph and large language model
Abstract
The invention provides a fault self-healing method, a system, electronic equipment and a storage medium based on a business semantic graph and a large language model, which convert unstructured business knowledge of an enterprise into a dynamic business semantic graph through knowledge modeling, real-time perception, intelligent reasoning, automatic restoration and continuous evolution full-link closed loop and combine the large language model to realize fault self-healing. The business semantic graph is constructed and updated, multidimensional observable data of the production environment are collected in real time, abnormality is detected, root cause and restoration suggestions are output by combining the data and the graph input model, a restoration request is automatically generated by high-confidence faults, release is completed after verification, and a reverse updating graph of a repeated disc report forms a self-evolution closed loop after fault recovery. According to the scheme, more than 80% of conventional fault recovery time is shortened to a minute level from a few hours, MTTR is reduced, fault risks are eliminated by precipitation organization level knowledge, manual intervention and regression risks are reduced, system intelligence is continuously improved, and stability and research and development efficiency of a production system are guaranteed.
Inventors
- LI NING
Assignees
- 璞华国际科技(武汉)有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251226
Claims (10)
- 1. A fault self-healing method based on business semantic graphs and a large language model is characterized by comprising the following steps: S1, extracting entities, relations and attributes used for representing business logic, entity relations and historical fault modes based on preset business knowledge data; S2, acquiring multidimensional observable data of a production environment in real time, analyzing the multidimensional observable data in real time, and generating a trigger event when an analysis result indicates abnormality; S3, responding to the trigger event, and inputting the current multidimensional observable data and the business semantic graph into a large language model in a combined way to output root cause positioning information and corresponding repair suggestions; S4, if the confidence coefficient of the root cause positioning information is higher than a preset threshold value, and the fault scene corresponding to the root cause positioning information accords with a preset automatic repair condition, automatically generating a repair request containing code modification or configuration change in a designated version control system based on the repair suggestion; S5, pushing the repair request to audit, and automatically executing the construction, test and release processes after the audit is passed so as to complete repair; And S6, automatically generating a structured multi-disc report based on the current processing process after fault recovery, and updating the service semantic graph according to the multi-disc report.
- 2. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein in S1, the preset business knowledge data comprises at least one of a business architecture document, a core link source code, a history fault disc report, a standard operation flow SOP and a business link topology graph; the vectorization processing includes: semantic coding is carried out on the extracted entities, relations and attributes by adopting a pre-trained language model so as to generate corresponding vector representations; the business semantic graph is synchronously updated with the newly added business knowledge data in real time through an event bus.
- 3. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein in S2, the multidimensional observable data includes server hardware indexes, distributed trace data, full traffic log, database slow query log and external dependent service state data, wherein the server hardware indexes include CPU usage, memory occupancy, IO throughput and network bandwidth.
- 4. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein in S2, the multi-dimensional observable data is analyzed in real time, and when the analysis result indicates abnormality, a trigger event is generated, which comprises: unified normalization processing is carried out on the collected multidimensional observable data; extracting real-time data features from the normalized data by an anomaly detection engine; Comparing the real-time data characteristics with a preset normal threshold range and/or a historical fault characteristic library; and when the comparison result meets a preset abnormality judgment condition, generating a trigger event, wherein the trigger event comprises an abnormality type, occurrence time and associated service node information.
- 5. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein the step S3 specifically comprises: responding to the trigger event, and retrieving service nodes and historical fault modes associated with the abnormality represented by the trigger event from the service semantic graph to form service up and down Wen Zitu; Splicing and formatting the current multidimensional observable data and the business context subgraph to construct prompting information which can be processed by a large language model; Inputting the prompt information into the large language model, and carrying out causal analysis of causal reasoning; And receiving and analyzing the output of the large language model to generate structured root cause positioning information and corresponding repair suggestions, wherein the root cause positioning information at least comprises a root cause entity, an influence link and a confidence level, and the repair suggestions comprise operation steps and risk levels.
- 6. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein the step S4 specifically comprises: when the confidence coefficient of the root cause positioning information is higher than a preset threshold value, judging whether the corresponding fault scene accords with an automatic repair condition according to a preset rule, wherein the automatic repair condition comprises that the fault root cause belongs to a defined standardized type and the repair operation has a verified safety scheme; If yes, automatically generating a repair script containing specific code modification contents or configuration change items based on the repair suggestion; Calling an application program interface of a version control system, and creating a repair request containing the repair script, the root cause analysis abstract and the modification context in a specified code warehouse; And associating the repair request to the trigger event, and pushing a generation notification to a designated audit terminal.
- 7. The method for self-healing a fault based on a business semantic graph and a large language model according to claim 1, wherein the step S6 specifically comprises: automatically generating a structured multiple disc report based on the trigger event, the root cause positioning information, the repair suggestion and the repair execution result, wherein the multiple disc report at least comprises a fault phenomenon, a root cause conclusion, repair measures and prevention suggestions; The compound disc report is used as a new knowledge source, and the entity, the relation and the attribute which are newly added or changed are identified and extracted; Updating the business semantic graph according to the extraction result, including adding new graph nodes and edges, and updating the vectorized representation of the existing nodes; And storing the key indexes and the knowledge updating records of the fault processing in a lasting manner so as to drive the continuous evolution of the business semantic graph and the association analysis model.
- 8. A fault self-healing system based on business semantic graph and large language model is characterized by comprising: The business semantic graph construction and updating module is used for extracting entities, relations and attributes for representing business logic, entity relations and historical fault modes based on preset business knowledge data; The multi-dimensional observable data acquisition and anomaly detection module is used for acquiring multi-dimensional observable data of a production environment in real time, analyzing the multi-dimensional observable data in real time, and generating a trigger event when an analysis result indicates anomaly; The root cause analysis and restoration suggestion generation module is used for responding to the trigger event, and inputting the current multidimensional observable data and the business semantic graph into a large language model in a combined way so as to output root cause positioning information and corresponding restoration suggestions; An automatic repair request generation module, configured to automatically generate a repair request including code modification or configuration change in a specified version control system based on the repair suggestion if the confidence level of the root cause positioning information is higher than a preset threshold value and a fault scene corresponding to the root cause positioning information meets a preset automatic repair condition; The auditing and issuing execution module is used for pushing the repairing request to audit and automatically executing the construction, test and issuing processes after the auditing is passed so as to complete repairing; And the multi-disc and map updating module is used for automatically generating a structured multi-disc report based on the current processing process after the fault recovery and updating the service semantic map according to the multi-disc report.
- 9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the business semantic graph and large language model based fault self-healing method according to any one of claims 1 to 7 when the program is executed.
- 10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the business semantic graph and large language model based fault self-healing method according to any one of claims 1 to 7.
Description
Fault self-healing method and system based on business semantic graph and large language model Technical Field The embodiment of the invention relates to the technical field of enterprise-level observability and fault emergency response, in particular to a fault self-healing method and system based on a business semantic graph and a large language model. Background The enterprise-level observability and fault emergency response are the core technical field of stable operation of the Internet enterprise production system, and are mainly oriented to key scenes such as log investigation, fault positioning, emergency repair and on-duty collaboration, and the core aims are to shorten fault recovery time and improve system stability and research and development efficiency. The current industry mainstream practice includes that a traditional log checking mode combined with keyword searching is checked manually row by row, a root cause positioning mode which is seriously dependent on personal experiences of a few senior staff, a repair process of emergency edition after temporarily modifying codes, and an on-duty awakening mechanism which is notified in turn through telephone or group information when a non-working time fault occurs, part of teams are led into a centralized log system, but are still mainly subjected to manual analysis, the intelligent level is low, the mode is still widely applied to middle and small teams and part of large enterprises, and the core index optimization problem of fault mean recovery time (MTTR, mean Time To Recovery) is involved. In order to improve efficiency, a centralized log system and a monitoring alarm tool are introduced into part of enterprises, so that unified collection of data and preliminary anomaly detection are realized. However, these tools still have insufficient intelligence at the root cause analysis and repair execution level, and cannot effectively combine semantic information such as business logic, system architecture and the like with real-time observation data, resulting in limited diagnosis accuracy. The process from fault location to repair and online is still broken, the degree of automatic closed loop is low, and quick and safe fault self-healing cannot be realized. When the production environment log is huge in volume, manual searching and analysis efficiency is extremely low, single fault positioning usually takes hours or even days, key knowledge such as a core service link, a historical fault mode and the like is highly concentrated on individual senior staff, serious knowledge fault and response capability loss can be caused once related staff is not on duty or leaves duty, a fault repairing process usually depends on temporarily modifying codes and is issued in an emergency mode, the process is long in time consumption and has higher regression risk, and when non-working time fails, related staff are called repeatedly in a telephone or a group, so that response and recovery timeliness are obviously prolonged due to incapability of timely connection. Disclosure of Invention The embodiment of the invention provides a fault self-healing method, a system, electronic equipment and a storage medium based on a business semantic graph and a large language model, which are used for solving the technical problems that the traditional fault investigation in the prior art depends on manual experience, is low in efficiency and is easy to form a knowledge island, the process from fault diagnosis to repair is split, the automation degree is low, the operation risk is high, and an operation and maintenance system cannot continuously learn from the processing experience and is lack of self-evolution capability. In a first aspect, an embodiment of the present invention provides a fault self-healing method based on a business semantic graph and a large language model, including: s1, extracting entities, relations and attributes used for representing business logic, entity relations and historical fault modes based on preset business knowledge data, and vectorizing the entities, relations and attributes to build and continuously update dynamic business semantic graphs. S2, acquiring multidimensional observable data of the production environment in real time, analyzing the multidimensional observable data in real time, and generating a trigger event when an analysis result indicates abnormality. S3, responding to the trigger event, and inputting the current multidimensional observable data and the business semantic graph into a large language model in a combined mode to output root cause positioning information and corresponding repair suggestions. S4, if the confidence coefficient of the root cause positioning information is higher than a preset threshold value, and the fault scene corresponding to the root cause positioning information accords with a preset automatic repair condition, automatically generating a repair request containing code modification or configurat