Search

CN-121998622-A - Intelligent diagnosis and logic self-healing system for server fault based on AI multi-mode feature fusion

CN121998622ACN 121998622 ACN121998622 ACN 121998622ACN-121998622-A

Abstract

The invention relates to the technical field of server cluster operation and maintenance risk assessment and discloses an intelligent server fault diagnosis and logic self-healing system based on AI multi-modal feature fusion, which comprises the steps of obtaining multi-modal telemetry data streams comprising utilization rate, memory allocation gradient and connection relation complexity, and extracting architecture design reference constraint to construct an operation state multi-dimensional reference model; the method comprises the steps of calculating a topological deviation value of a real-time state vector deviated from a model, decoupling steady-state deviation and dynamic disturbance by utilizing feature decomposition logic, positioning logic abnormal nodes, determining a compensation parameter set based on a consistency objective function minimization rule, driving relevant node parameters to converge to design constraint, implementing spectrum distribution audit by utilizing a Laplacian operator, and verifying feature spectrum consistency.

Inventors

  • DUAN JUN
  • SHEN KAI

Assignees

  • 上海特华计算机系统集成有限公司

Dates

Publication Date
20260508
Application Date
20260128

Claims (10)

  1. 1. A server fault intelligent diagnosis and logic self-healing system based on AI multi-mode feature fusion is characterized by comprising: The multi-mode telemetry data acquisition module is used for acquiring multi-mode telemetry data streams of the target server cluster, wherein the multi-mode telemetry data streams comprise processor utilization rate, memory allocation gradient and network connection relation complexity; The running state reference model construction module is used for executing semantic analysis on the system architecture configuration document to extract design reference constraint, and mapping the processor utilization rate, the memory allocation gradient and the network connection relation complexity into parameterized nodes in the logic topology so as to construct a running state multidimensional reference model meeting the design reference constraint; the operation deviation recognition module is used for extracting a real-time state vector of the multi-mode telemetry data stream, calculating a topology deviation value of the real-time state vector deviated from the operation state multi-dimensional reference model, and extracting a steady-state deviation component representing hardware failure and a dynamic disturbance component representing time sequence instability by executing feature decomposition on the topology deviation value so as to lock a logic abnormal node in the server cluster; The logic self-healing engine module is used for calculating a compensation parameter set for enabling the global logic state to be in a steady state according to a consistency objective function minimization rule aiming at the logic abnormal node, adjusting the memory allocation weight and the request distribution step length of the associated node by utilizing the compensation parameter set, and enabling the logic operation layer of the server cluster to meet design reference constraint through parameter iteration convergence; And the running state consistency auditing module is used for carrying out characteristic spectrum distribution auditing on the reconstructed logic spectrum by using the Laplacian operator, and verifying the consistency of the real-time characteristic spectrum and the original reference characteristic spectrum so as to confirm that the service logic link is restored to the reference running state.
  2. 2. The intelligent diagnosis and logic self-healing system of server fault based on AI multimode feature fusion according to claim 1, wherein the operation state reference model construction module defines the processor utilization ratio as the flow dimension feature of the logic topology, the memory allocation gradient as the load dimension feature of the logic topology by executing feature extraction operation when constructing the operation state multidimensional reference model, and orthogonalizes the flow dimension feature and the load dimension feature by using design reference constraint.
  3. 3. The intelligent server fault diagnosis and logic self-healing system based on AI multi-mode feature fusion according to claim 1, wherein the operation deviation recognition module recognizes a principal component with monotonic evolution characteristics in the topology deviation value as a steady-state deviation component by constructing a deviation covariance matrix when performing feature decomposition, and defines a high-frequency component in the topology deviation value as a dynamic disturbance component.
  4. 4. The intelligent diagnosis and logic self-healing system of server fault based on AI multi-modal feature fusion according to claim 1, wherein the logic self-healing engine module comprises a policy driving sub-module, the policy driving sub-module is used for retrieving a repairing instruction sequence from a fault handling knowledge graph according to a hierarchy of logic abnormal nodes, and the repairing instruction sequence comprises a logic parameter resetting instruction, a business process migration instruction and a device logic state changing instruction.
  5. 5. The intelligent server fault diagnosis and logic self-healing system based on AI multi-modal feature fusion according to claim 1, wherein the running state consistency auditing module determines the auditing result by calculating the generalized Euclidean distance between the real-time feature spectrum and the original reference feature spectrum when performing feature spectrum distribution auditing, and the determination rule follows the following formula: , wherein, For a generalized euclidean distance, As a total number of feature dimensions, Is the first The weight coefficient of the dimensional feature, Is the first in the real-time characteristic spectrum The value of the characteristic is a value of, Is the first in the original reference characteristic spectrum The value of the characteristic is a value of, And judging a threshold value for the preset consistency.
  6. 6. The intelligent diagnosis and logic self-healing system of server fault based on AI multimode feature fusion of claim 4, wherein the logic self-healing engine module further comprises an execution monitoring sub-module, the execution monitoring sub-module is used for triggering secondary feature comparison operation after the repair instruction sequence is executed, and calculating the overall connection complexity reduction of the server cluster after repair to verify the disappearance of the topology deviation value.
  7. 7. The intelligent diagnosis and logic self-healing system of server fault based on AI multi-modal feature fusion according to claim 1, wherein the multi-modal telemetry data acquisition module is further configured to acquire system log alarm information, and perform text vectorization processing on the system log alarm information to serve as an input variable for root cause determination by the operation deviation recognition module.
  8. 8. The intelligent diagnosis and logic self-healing system of server fault based on AI multimode feature fusion according to claim 1, wherein the logic self-healing engine module coordinates the load weight of each parameterized node by introducing a parameter compensation operator into a second-order constraint equation to eliminate the secondary logic concussion generated by local parameter change when executing adjustment operation.
  9. 9. The intelligent diagnosis and logic self-healing system of server fault based on AI multi-modal feature fusion according to claim 1, wherein the operation state reference model construction module further comprises a semantic extraction unit, the semantic extraction unit is used for carrying out path recognition on the system logic description file by using a text recognition algorithm to determine the logic boundary of each parameterized node in the design reference constraint.
  10. 10. The intelligent diagnosis and logic self-healing system of server fault based on AI multimode feature fusion of claim 1, wherein the running state consistency auditing module further comprises a link verifying unit, and the link verifying unit is used for detecting response time delay of the service logic link through the simulated service request based on the result of feature spectrum distribution auditing.

Description

Intelligent diagnosis and logic self-healing system for server fault based on AI multi-mode feature fusion Technical Field The invention belongs to the technical field of server cluster operation and maintenance risk assessment, and particularly relates to an intelligent server fault diagnosis and logic self-healing system based on AI multi-mode feature fusion. Background In a large-scale dynamic cluster environment, a traditional risk processing mechanism mainly performs parameter calibration on local components by calling fixed repair scripts, however, highly complex topological dependence exists in the cluster, unquantized logic stress is often generated for local abnormal repair actions, the global constraint of service topology is ignored, the existing resource balance situation of a system is easily broken by the isolated repair actions, nonlinear disturbance is generated on a service link, and thus the sub-linear logic oscillation or full-network topology collapse is caused. At present, the industry tries to improve the accuracy of self-healing by expanding expert experience rules or introducing more dimensional monitoring indexes, analysis shows that the means still belongs to a static mode matching mode in nature, geometrical deviation between an operation characteristic and an initial design manifold cannot be calculated in real time, when the existing scheme generates dynamic distortion facing to the operation state characteristic, uncertainty exists in a repairing process due to the lack of dynamic regression capability aiming at global logic consistency, the contradiction between the local repairing effectiveness and global stability becomes a fundamental problem which restricts the management and control efficiency of a large-scale server cluster, for example, chinese patent publication No. CN112988444A discloses a processing method for server cluster fault diagnosis, when automatic diagnosis fails, automatic reporting and work order distribution are realized by utilizing keyword matching, the technology essentially belongs to a known mode matching flow compensation mode, the closed-loop management of a diagnosis flow is concerned, a non-fault root causes dynamic decoupling at a global topology manifold layer, dynamic distortion caused by facing high-frequency service load switching, the lack of analysis capability aiming at steady state failure and dynamic disturbance is not available, the geometrical deviation driving parameters of the real-time operation characteristic and the initial design manifold is difficult to drive the geometrical deviation, and the dynamic regression capability of the global logic consistency is difficult to ensure after the dynamic regression logic state logic consistency is preset. Therefore, how to realize risk intelligent diagnosis and self-healing instruction arrangement under global topology constraint ensures that the system state can return to a preset logic steady state after being disturbed, and becomes the technical problem to be solved by the invention. Disclosure of Invention The invention provides an AI multi-mode feature fusion-based server fault intelligent diagnosis and logic self-healing system, which comprises: The multi-mode telemetry data acquisition module is used for acquiring multi-mode telemetry data streams of the target server cluster, wherein the multi-mode telemetry data streams comprise processor utilization rate, memory allocation gradient and network connection relation complexity; The running state reference model construction module is used for executing semantic analysis on the system architecture configuration document to extract design reference constraint, and mapping the processor utilization rate, the memory allocation gradient and the network connection relation complexity into parameterized nodes in the logic topology so as to construct a running state multidimensional reference model meeting the design reference constraint; the operation deviation recognition module is used for extracting a real-time state vector of the multi-mode telemetry data stream, calculating a topology deviation value of the real-time state vector deviated from the operation state multi-dimensional reference model, and extracting a steady-state deviation component representing hardware failure and a dynamic disturbance component representing time sequence instability by executing feature decomposition on the topology deviation value so as to lock a logic abnormal node in the server cluster; The logic self-healing engine module is used for calculating a compensation parameter set for enabling the global logic state to be in a steady state according to a consistency objective function minimization rule aiming at the logic abnormal node, adjusting the memory allocation weight and the request distribution step length of the associated node by utilizing the compensation parameter set, and enabling the logic operation layer of the server cluster to meet de