CN-121996455-A - GPU server fault pre-diagnosis system and self-healing method

CN121996455ACN 121996455 ACN121996455 ACN 121996455ACN-121996455-A

Abstract

The invention provides a GPU server fault pre-diagnosis system and a self-healing method, the method comprises the steps of periodically collecting operation data of a GPU, calculating window average values for SM utilization rate and power consumption, calculating window accumulated values for ECC error counts and calculating power consumption change slopes, constructing a state vector based on the operation data and carrying out normalization processing to obtain the operation state vector, inputting the operation state vector into a pre-trained self-encoder model, calculating mean square errors of the state vector and the reconstruction vector to serve as reconstruction errors, adding the reconstruction errors and regular terms to obtain risk scores, calculating benefit scores of each repairing action based on the risk scores, the operation data and environmental context information, and executing selected repairing actions, wherein the repairing actions comprise maintaining, reducing frequency, migrating or resetting, and updating node state marks according to execution results.

Inventors

Zhou Kengran
YANG HONGDONG

Assignees

尚阳科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260113

Claims (10)

1. A GPU server fault pre-diagnosis and self-healing method, the method comprising: periodically collecting operation data of the GPU, wherein the operation data comprises SM utilization rate, ECC error count, power consumption reading and accumulated operation time data, calculating window average values of the SM utilization rate and the power consumption, calculating window accumulated values of the ECC error count and calculating power consumption change slope; The running state vector is input into a pre-trained self-encoder model, and the mean square error of the state vector and the reconstruction vector is calculated to be used as a reconstruction error; Calculating benefit scores of all the repair actions based on the risk scores, the running data and the environmental context information, wherein the benefit scores are weighted combinations of the risk scores, the task importance, the resource margin and the action cost; and executing selected repairing actions, wherein the repairing actions comprise maintaining, frequency reducing, migration or resetting, and updating the node state mark according to the executing result.
2. The method for fault pre-diagnosis and self-healing of a GPU server according to claim 1, wherein the periodic acquisition comprises the specific steps of setting a main data acquisition period, and sampling at time intervals shorter than the main data acquisition period in the main data acquisition period, thereby forming a plurality of groups of sampled data sequences in a time window.
3. The GPU server fault pre-diagnosis and self-healing method according to claim 1, wherein the self-encoder model has a symmetric neural network structure, wherein the encoder portion compresses the input state vector gradually to a low-dimensional potential space, and the decoder portion reconstructs the original vector gradually from the potential representation, and wherein both the encoder and the decoder are implemented by using a multi-layer fully connected neural network.
4. The GPU server failure prognosis and self-healing method according to claim 1, wherein the environmental context information specifically includes a task importance score provided by a clustered task scheduling system that identifies how critical the task currently operated by the GPU is, and a load margin assessment value provided by a resource management platform that reflects the current remaining schedulable computing resources of the corresponding GPU node.
5. The method for fault pre-diagnosis and self-healing of a GPU server according to claim 1, wherein in the benefit score calculation, the weights of the factors are configured according to rules such that risk scores occupy dominant weights, task importance occupies secondary weights, and resource margins occupy auxiliary weights, so as to embody decision logic of risk priority and business influence.
6. The method for fault pre-diagnosis and self-healing of a GPU server according to claim 1, wherein the setting of the action cost is based on the potential influence degree of different operations on system continuity and task stability, and the numerical relationship of the action cost is sequentially from high to low, namely a reset operation, a migration operation, a frequency-reducing operation and a maintenance operation, wherein the maintenance operation is free of cost.
7. The method according to claim 1, wherein the repairing operation is performed by calling a corresponding system interface, the down-conversion operation adjusts the core frequency by calling a device management command line tool provided by a GPU manufacturer, the migration operation drives out and reschedules tasks by calling an application program interface of a container arrangement platform or a job scheduling system, and the reset operation performs a soft reset function by calling a device control interface provided by a GPU driver layer.
8. The method for predicting and self-healing a failure of a GPU server according to claim 1, wherein the update rule of the node status flag is that if the selected action is a hold, the node status flag is idle, if the action is successfully executed, the node status flag is executed, if the action fails due to a system cause in the execution process, the node status flag is failed, if the action is not executed due to an external condition, the node status flag is skipped, and the node status flag is used for determining whether a retry or an upgrade operation is required in a subsequent cycle.
9. The method for fault pre-diagnosis and self-healing of a GPU server according to claim 1, wherein the applying of the risk score includes setting a classification threshold, wherein the GPU is considered to be in a healthy state when the score is lower than a first threshold, wherein the GPU is determined to be in a sub-healthy state when the score is between the first threshold and a second threshold, wherein attention is required, and wherein the GPU is determined to be in a high risk state when the score is higher than the second threshold, wherein a repair mechanism is required to be triggered preferentially.
10. A GPU server fault pre-diagnosis and self-healing system, the system comprising: The system comprises a data acquisition module, a power consumption change slope calculation module, a state vector calculation module, a power consumption change slope calculation module and a power consumption change slope calculation module, wherein the data acquisition module is used for periodically acquiring operation data of the GPU, and the operation data comprises SM utilization rate, ECC error count, power consumption reading and accumulated operation time data; the risk assessment module is used for inputting the running state vector into a pre-trained self-encoder model, and calculating the mean square error of the state vector and the reconstruction vector to be used as a reconstruction error; The system comprises an action generation module, a restoration module and a restoration module, wherein the action generation module is used for calculating the benefit score of each restoration action based on risk scores, operation data and environmental context information, wherein the benefit score is a weighted combination of the risk scores, task importance, resource margin and action cost; And the action execution module is used for executing the selected repairing action, wherein the repairing action comprises maintaining, frequency reducing, migration or resetting, and updating the node state mark according to the execution result.

Description

GPU server fault pre-diagnosis system and self-healing method Technical Field The invention belongs to the technical field of computer equipment, and particularly relates to a GPU server fault pre-diagnosis system and a self-healing method. Background With the rapid development of artificial intelligence, big data and high performance computing, GPU servers have become the computing cores of scientific research, cloud computing and enterprise-level data centers. Under long-time high-load operation, the GPU server often faces implicit risks such as power consumption fluctuation, temperature control abnormality, hardware aging, memory errors and the like, and the risks often do not trigger system error reporting immediately, but gradually accumulate and finally cause serious faults. Once the GPU node is abnormal, the characteristics of strong burstiness, high propagation speed, high positioning difficulty and the like are often presented, and training task interruption or service system paralysis is easily caused. The existing fault processing mode of the GPU server mainly depends on manual investigation by operation and maintenance personnel and is aided by log analysis, temperature monitoring or hardware self-checking tools, the method is not only delayed in response, but also easy to cause misjudgment and missed judgment due to complex coupling among fault features in a multi-GPU parallel and cross-platform deployment scene. Some platforms attempt to introduce an automated alert mechanism based on a threshold, but because the threshold is difficult to adapt to different tasks and hardware environments, the GPU is often unable to be identified as being in an early degraded "sub-health" state, and lacks a full-link fault handling closed loop. Therefore, the current technology cannot provide a complete scheme capable of taking early pre-diagnosis, fine repair and system self-healing into account for the GPU server, and is difficult to meet the requirements of high-performance computing environment on stability and continuity. Disclosure of Invention The invention aims to design a fault pre-diagnosis system and a self-healing method of a GPU server, which can identify the sub-health state of the GPU, can adaptively generate proper repair paths in different deployment environments and task types, and improves the operation stability and the overall recovery capacity of the system through a state marking mechanism, thereby realizing the intelligent management of the GPU server from pre-diagnosis to self-healing in engineering. To achieve the above object, in a first aspect of the present invention, there is provided a GPU server fault pre-diagnosis self-healing method, the method comprising: periodically collecting operation data of the GPU, wherein the operation data comprises SM utilization rate, ECC error count, power consumption reading and accumulated operation time data, calculating window average values of the SM utilization rate and the power consumption, calculating window accumulated values of the ECC error count and calculating power consumption change slope; The running state vector is input into a pre-trained self-encoder model, and the mean square error of the state vector and the reconstruction vector is calculated to be used as a reconstruction error; Calculating benefit scores of all the repair actions based on the risk scores, the running data and the environmental context information, wherein the benefit scores are weighted combinations of the risk scores, the task importance, the resource margin and the action cost; and executing selected repairing actions, wherein the repairing actions comprise maintaining, frequency reducing, migration or resetting, and updating the node state mark according to the executing result. Further, the periodic acquisition comprises the specific steps of setting a main data acquisition period, and sampling at time intervals shorter than the main data acquisition period in the main data acquisition period, so as to form a plurality of groups of sampling data sequences in a time window. Further, the self-encoder model has a symmetrical neural network structure, an encoder part of the self-encoder model gradually compresses an input state vector into a low-dimensional potential space, and a decoder part gradually reconstructs an original vector from a potential representation, wherein the encoder and the decoder are realized by adopting a multi-layer fully-connected neural network. Further, the environmental context information specifically includes a task importance score provided by the clustered task scheduling system that identifies a criticality of a task currently operated by the GPU, and a load margin assessment value provided by the resource management platform that reflects a current remaining schedulable computing resource of the corresponding GPU node. Further, in the benefit score calculation, the weights of the factors are configured according to the fol