Search

CN-122019224-A - Data center fault diagnosis and repair method based on multi-agent collaborative reasoning

CN122019224ACN 122019224 ACN122019224 ACN 122019224ACN-122019224-A

Abstract

The invention relates to a data center fault diagnosis and restoration method based on multi-agent collaborative reasoning, which effectively overcomes the problems of insufficient knowledge depth and illusion of a single model through a multi-agent collaborative reasoning mechanism, realizes multi-view cross validation and accurate root cause positioning, introduces digital twin simulation and negative feedback iterative optimization, solves the secondary fault risk caused by direct execution of a restoration script, ensures the safe and reliable scheme, improves the understanding capability of a large model on heterogeneous data through multi-mode data semantic mapping, constructs a hierarchical execution and real-time fusing rollback system, and eliminates the hidden danger of service interruption in an automatic process, thereby realizing the transition from manual assistance to safe autonomy and remarkably improving the accuracy, efficiency and reliability of the operation and maintenance of the data center.

Inventors

  • MA YANPENG
  • GUO SHUAI
  • WU ZILONG
  • WU ZHENGZHONG
  • ZHONG JING
  • YANG YALONG

Assignees

  • 中国人民解放军61618部队

Dates

Publication Date
20260512
Application Date
20251230

Claims (10)

  1. 1. A data center fault diagnosis and repair method based on multi-agent collaborative reasoning is characterized by comprising the following steps of, Collecting time sequence monitoring indexes and system log data in real time, and converting the time sequence monitoring indexes and the system log data into serialized environment state vectors; Based on the environment state vector, generating initial assumptions in parallel by a plurality of vertical domain agents, and outputting an optimal root cause conclusion by a judgment agent through multi-round cross interaction convergence; and generating an atomic operation sequence according to the optimal root cause conclusion, and performing simulation verification and iterative optimization in a digital twin environment to obtain an optimal repair scheme passing verification.
  2. 2. The method for diagnosing and repairing a data center fault based on collaborative reasoning about multiple agents as set forth in claim 1, wherein the real-time collection of time series monitoring metrics and system log data via infrastructure probes and conversion into a serialized environmental state vector comprises, Performing signal smoothing and feature calculation on the time sequence monitoring index by adopting Savitzky-Golay convolution, extracting first-order trend features and second-order fluctuation features, and performing dynamic semantic marking by Z-Score dynamic threshold judgment to form semantic enhancement features of the time sequence monitoring index; Based on the semantic enhancement features, on-line template extraction and variable regular mask processing based on weighted editing distance are adopted for the system log, and a standardized event set is generated so as to realize semantic alignment of the log and the index; And calculating TF-IDF weight based on the standardized event set, pruning and sorting, and then constructing a composite environment state vector.
  3. 3. The method for diagnosing and repairing a data center fault based on collaborative reasoning of multiple agents according to claim 2, wherein generating initial hypotheses in parallel by multiple vertical domain agents, outputting optimal root cause conclusions by the decision agents through multiple rounds of cross-correlation convergence comprises, The plurality of vertical domain agents generate an initial hypothesis based on the composite environmental state vector, the initial hypothesis including a hypothesis unique identifier, a source agent identifier, a root cause code, a set of evidence chains, and an initial confidence level to form a structured multi-dimensional set of fault hypotheses; the judgment intelligent agent carries out iterative updating on the confidence coefficient of each hypothesis through multiple rounds of cross interaction, and a confidence coefficient updating formula is as follows: ; Wherein, the Is the first Assumption in round iteration Is used to determine the confidence level of the (c) in the (c), In order to learn the step size of the rate, To determine domain authority weights for agent j, Indicating the polarity of the feedback and, Representing evidence intensity; And based on the iteratively updated confidence coefficient, judging convergence and outputting an optimal root cause conclusion when the confidence coefficient meets a preset convergence condition.
  4. 4. The data center fault diagnosis and restoration method based on multi-agent collaborative reasoning of claim 3, wherein the outputting of the optimal root cause conclusion by the decision agent through multi-round cross-interaction convergence further comprises, Calculating an evidence integrity score, wherein the formula is as follows: ; Wherein, the In order for the number of features to be actually matched, Standard indicator bases for the fault types defined in the knowledge base; and marking the risk of insufficient evidence in the root cause analysis report when the score is lower than a preset threshold value based on the evidence integrity score, and integrating the marked report into an optimal root cause conclusion.
  5. 5. The method for diagnosing and repairing a data center fault based on multi-agent collaborative reasoning according to claim 4, wherein generating an atomic operation sequence according to root cause conclusion and performing simulation verification and iterative optimization in a digital twin environment comprises, Generating a repairing scheme consisting of an atomic operation sequence based on the optimal root cause conclusion integrated with the evidence deficiency risk mark, wherein each atomic operation in the atomic operation sequence binds a corresponding rollback instruction and a verification index; And calculating a feasibility score after the repair scheme is executed on the digital twin instance, wherein the formula is as follows: ; Wherein, the In order to achieve a repair plan, the system, For the observed system state after execution of an atomic sequence of operations in a digital twin environment, In order to recover the weight of the object, Indicating the degree of restoration of the service, As a risk weight of the risk-based system, A risk value indicative of a system crash or secondary failure, In order to be a cost weight for the device, Representing execution time costs; Based on the feasibility score, when the score is lower than a preset safety threshold, a differential diagnosis report is generated and is returned as a negative feedback signal, and iterative optimization of the repair scheme is triggered until an optimal repair scheme meeting all safety constraints and passing verification is obtained 。
  6. 6. The method for diagnosing and repairing a data center fault based on multi-agent collaborative reasoning according to claim 1, further comprising, The optimal repair scheme passing verification is carried out in a grading manner in a physical production environment, and specifically comprises, Screening low-importance nodes based on topology betweenness centrality to serve as test point batches to carry out heuristic execution on the optimal repair scheme And calculating the relative deviation degree of the key performance indexes in real time in a high-frequency observation window, wherein the formula is as follows: ; Wherein, the Is the value of the j-th dimension of the real-time state vector, Is the value of the j-th dimension of the baseline state vector, A smoothing term to prevent denominator zero; and based on the relative deviation, performing dynamic fusing detection to control the execution risk, and taking a detection result as a quantitative basis for the subsequent branch switching.
  7. 7. The method for diagnosing and repairing the fault of the data center based on the multi-agent collaborative reasoning according to claim 6, wherein the method is characterized in that the dynamic fusing detection is performed by real-time index monitoring to realize fusing or rolling release, Triggering a hard interrupt and executing a pre-bound atom rollback instruction when any index deviation exceeds a preset hard constraint threshold or a high-risk log appears, so as to restore the system state to a baseline level and prevent risk diffusion; and based on the execution result of the atomic rollback instruction, adopting an exponential stepping strategy to roll and release when all index deviation degrees are lower than a preset safety threshold value until all affected nodes are covered, and realizing smooth recovery of business service.
  8. 8. The data center fault diagnosis and repair system based on multi-agent collaborative reasoning is characterized by comprising a semantic mapping module, a root cause analysis module and a repair verification module; The semantic mapping module is used for collecting time sequence monitoring indexes and system log data in real time and converting the time sequence monitoring indexes and the system log data into serialized environment state vectors; the root cause analysis module is used for generating initial assumptions in parallel by a plurality of vertical domain agents based on the environmental state vector, and outputting optimal root cause conclusion by the judgment agent through multi-round cross-correlation convergence; the repair verification module is used for generating an atomic operation sequence according to the optimal root cause conclusion, and performing simulation verification and iterative optimization in a digital twin environment to obtain an optimal repair scheme passing verification.
  9. 9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program executable on the processor, and the processor performs the steps in the multi-agent collaborative reasoning-based data center fault diagnosis and repair method as claimed in any one of claims 1-7 when the program is executed on the processor.
  10. 10. A storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the multi-agent collaborative reasoning-based data center fault diagnosis and restoration method of any of claims 1-7.

Description

Data center fault diagnosis and repair method based on multi-agent collaborative reasoning Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a data center fault diagnosis and repair method based on multi-agent collaborative reasoning. Background With the rapid development of cloud computing, 5G, internet of things and artificial intelligence, the scale and complexity of a data center serving as a core infrastructure for carrying mass computing and storage are rapidly increased. The number of hardware such as servers, storage arrays, network devices, etc. has multiplied, accompanied by massive amounts of multimodal monitoring data (timing metrics, system logs, topological relationships) and highly coupled system dependencies. The operation and maintenance of the traditional data center mainly depend on experience judgment and manual operation of senior experts, and the traditional data center has the defects of low response speed, difficult knowledge inheritance and difficult large-scale expansion. Although the logic definition of the subsequent rule-based automation system is improved, the rule base has high maintenance cost, lag update and insufficient system flexibility, and is difficult to cover complex and changeable fault scenes. In recent years, breakthrough of Large Language Model (LLM) technology brings new opportunities for intelligent operation and maintenance of data centers. In the prior art, an operation and maintenance auxiliary system is constructed by combining LLM with a vector database, and fault inquiry and repair suggestions are provided for operation and maintenance personnel by retrieving equipment parameters, historical cases and operation and maintenance knowledge, so that the information processing efficiency is remarkably improved. However, such schemes are still "open loop" auxiliary tools in nature, and the generated diagnostic conclusions and repair scripts need to be manually and finally judged, verified and manually executed, failing to achieve a true automated closed loop. In addition, the single general large model is insufficient in knowledge depth in the vertical field, is easy to generate 'machine illusion', is limited in diagnosis accuracy, is weak in semantic understanding ability on multi-mode heterogeneous data, is difficult to accurately position deep root causes, is directly applied to production environments, lacks safety simulation verification, is at risk of secondary faults, and is not provided with an effective real-time feedback and risk blocking mechanism in the execution process, so that hidden danger of service interruption is high. Disclosure of Invention The invention aims to provide a data center fault diagnosis and repair method based on multi-agent collaborative reasoning, which is used for solving the problems that decision and execution are split, a single large model is easy to generate illusion, safety simulation verification before execution is lacked, multi-mode data understanding is insufficient and the like in the prior art. In order to achieve one of the above objects, an embodiment of the present invention provides a data center fault diagnosis and repair method based on multi-agent collaborative reasoning, the method comprising, Collecting time sequence monitoring indexes and system log data in real time, and converting the time sequence monitoring indexes and the system log data into serialized environment state vectors; Based on the environment state vector, generating initial assumptions in parallel by a plurality of vertical domain agents, and outputting an optimal root cause conclusion by a judgment agent through multi-round cross interaction convergence; and generating an atomic operation sequence according to the optimal root cause conclusion, and performing simulation verification and iterative optimization in a digital twin environment to obtain an optimal repair scheme passing verification. As a further improvement of one embodiment of the present invention, the method further comprises the step of collecting the time series monitoring index and the system log data in real time through the infrastructure probe, and converting the time series monitoring index and the system log data into a serialized environment state vector comprises, Performing signal smoothing and feature calculation on the time sequence monitoring index by adopting Savitzky-Golay convolution, extracting first-order trend features and second-order fluctuation features, and performing dynamic semantic marking by Z-Score dynamic threshold judgment to form semantic enhancement features of the time sequence monitoring index; Based on the semantic enhancement features, on-line template extraction and variable regular mask processing based on weighted editing distance are adopted for the system log, and a standardized event set is generated so as to realize semantic alignment of the log and the index; A