CN-121996452-A - Abnormal node detection method, device, electronic equipment and storage medium

CN121996452ACN 121996452 ACN121996452 ACN 121996452ACN-121996452-A

Abstract

The disclosure provides an abnormal node detection method, an abnormal node detection device, electronic equipment and a storage medium, and relates to the technical field of computers. The method comprises the steps of respectively carrying out first silence data damage detection on a plurality of original computing nodes in a distributed training system before training starts, determining a plurality of target computing nodes participating in training according to a plurality of obtained first detection results, respectively carrying out second silence data damage detection on the plurality of target computing nodes in the distributed training process, identifying abnormal computing nodes according to a plurality of obtained second detection results, removing the abnormal computing nodes from the plurality of target computing nodes, obtaining training state data corresponding to the abnormal computing nodes, and recovering the distributed training process based on the training state data. The method and the device realize timely elimination and replacement of the abnormal computing nodes through screening before training and continuous detection in training, so that stability and reliability of a distributed training process are guaranteed.

Inventors

Request for anonymity
Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251218

Claims (14)

1. An abnormal node detection method, comprising: Before training starts, respectively performing first silent data damage detection on a plurality of original computing nodes in a distributed training system to obtain first detection results of the plurality of original computing nodes, and determining a plurality of target computing nodes participating in training according to the plurality of first detection results; In the distributed training process, respectively carrying out second silent data damage detection on a plurality of target computing nodes to obtain second detection results of the plurality of target computing nodes; Identifying abnormal computing nodes according to the second detection results, and eliminating the abnormal computing nodes from the target computing nodes; and acquiring training state data corresponding to the abnormal computing node, and recovering a distributed training process based on the training state data.
2. The abnormal node detection method of claim 1, wherein the first silence data corruption detection comprises aggregate communication detection, first matrix multiplication detection, and model training detection; the first silent data damage detection is performed on a plurality of original computing nodes in the distributed training system respectively to obtain first detection results of the plurality of original computing nodes, and the method comprises the following steps: Aiming at any original computing node in the distributed training system, carrying out the set communication detection on the original computing node to obtain a first detection sub-result; Performing the first matrix multiplication detection on the original computing node to obtain a second detection sub-result; and under the condition that the first detection sub-result and the second detection sub-result both accord with a preset detection passing condition, carrying out model training detection on the original computing node based on a preset model training task to obtain the first detection result.
3. The method for detecting an abnormal node according to claim 2, wherein said performing the aggregate communication detection on the original computing node to obtain a first detection sub-result includes: Performing aggregation operation on the data to be aggregated of the original computing node to obtain first aggregated data; Transmitting the first aggregate data to other original computing nodes and receiving second aggregate data transmitted by the other original computing nodes, wherein the other original computing nodes are original computing nodes except the original computing nodes in the distributed training system; And comparing the first aggregate data and the received second aggregate data with a preset matching condition to obtain the first detection sub-result.
4. The abnormal node detection method according to claim 2, wherein the performing the first matrix multiplication detection on the original computing node to obtain a second detection sub-result includes: Based on a preset first matrix, performing matrix multiplication calculation on the original calculation nodes to obtain a first matrix multiplication result; And comparing the first matrix multiplication result with a first preset calculation result to obtain the second detection sub-result.
5. The abnormal node detection method according to claim 2, wherein the performing the model training detection on the original computing node based on a preset model training task to obtain the first detection result includes: Based on preset training data and a preset training step, running the preset model training task on the original computing node to obtain a training result; and comparing the training result with a preset reference result to obtain the first detection result.
6. The abnormal node detection method according to claim 2, characterized in that the method further comprises: Determining that the original computing node is damaged by silent data under the condition that any one of the first detection sub-result, the second detection sub-result and the first detection result does not accord with a preset detection passing condition; removing the original computing node from the distributed training system, and selecting a first candidate computing node from standby computing resources to join the distributed training system; and repeatedly executing the first silent data corruption detection on the first candidate computing node.
7. The abnormal node detection method of claim 1, wherein the distributed training system employs pipelined parallel training, and the second silence data corruption detection comprises a second matrix multiplication detection; And performing second silent data damage detection on the plurality of target computing nodes to obtain second detection results of the plurality of target computing nodes, wherein the second detection results comprise: Aiming at any one target computing node in a plurality of target computing nodes, performing second matrix multiplication detection on the target computing node in a pipeline idle period corresponding to the target computing node to obtain a second detection result; And performing the second matrix multiplication detection on the target computing node to obtain the second detection result, including: based on a preset second matrix, performing matrix multiplication calculation on the target calculation node to obtain a second matrix multiplication result; And comparing the second matrix multiplication result with a second preset calculation result to obtain the second detection result.
8. The abnormal node detection method according to claim 1, characterized in that the method further comprises: Acquiring running state information of any one of a plurality of target computing nodes; Comparing the running state information with a preset state threshold value, and judging whether the target computing node is an abnormal computing node according to a comparison result; wherein the operating state information includes at least one of an operating temperature, an operating voltage, and an operating frequency of the target computing node.
9. The abnormal node detection method according to claim 1, wherein in the distributed training process, performing second silence data corruption detection on the plurality of target computing nodes respectively to obtain second detection results of the plurality of target computing nodes, and the method comprises: aiming at any one target computing node in a plurality of target computing nodes, carrying out the second silent data damage detection on the target computing node for a plurality of times in different training periods to obtain a plurality of second detection results of the target computing node; the identifying an abnormal computing node according to the plurality of second detection results includes: Determining an abnormal proportion of any target computing node in the distributed training system according to the second detection results of the target computing node and the number of the second detection results; and identifying the target computing node as an abnormal computing node under the condition that the abnormal proportion exceeds a preset threshold value.
10. The abnormal node detection method according to claim 8 or 9, wherein the acquiring training state data corresponding to the abnormal computing node and recovering a distributed training process based on the training state data includes: Determining a training rollback position of the distributed training system according to the training state data; and in the training rollback position, recovering the distributed training process based on the training state data.
11. The abnormal node detection method according to claim 1, wherein after identifying an abnormal computing node from a plurality of the second detection results and culling the abnormal computing node from a plurality of the target computing nodes, the method further comprises: Selecting a second candidate computing node from the standby computing resources to replace the abnormal computing node; Synchronizing the training state data to the second candidate computing node and joining the second candidate computing node to the distributed training system.
12. An abnormal node detecting apparatus, comprising: The first detection module is used for respectively carrying out first silent data destruction detection on a plurality of original computing nodes in the distributed training system before training is started to obtain first detection results of the plurality of original computing nodes, and determining a plurality of target computing nodes participating in training according to the plurality of first detection results; The second detection module is used for respectively carrying out second silent data damage detection on the plurality of target computing nodes in the distributed training process to obtain second detection results of the plurality of target computing nodes; The node eliminating module is used for identifying abnormal computing nodes according to the plurality of second detection results and eliminating the abnormal computing nodes from the plurality of target computing nodes; And the training recovery module is used for acquiring training state data corresponding to the abnormal computing node and recovering a distributed training process based on the training state data.
13. An electronic device, comprising: A processing unit, and A storage unit configured to store executable instructions of the processing unit; wherein the processing unit is configured to perform the abnormal node detection method of any one of claims 1 to 11 via execution of the executable instructions.
14. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processing unit, implements the abnormal node detection method of any one of claims 1 to 11.

Description

Abnormal node detection method, device, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of computers, and in particular relates to a method and a device for detecting abnormal nodes, electronic equipment and a storage medium. Background With the rapid expansion of the scale of artificial intelligence models, large-scale distributed training has become an important foundation for building complex models. In such training tasks, a large number of computing nodes are in a high-load parallel running state for a long time, and each computing node depends on hardware stability and communication correctness when performing tensor computation, gradient communication and parameter synchronization. However, in the related art, when a part of hardware or a link is sporadically slightly abnormal, an explicit error prompt is not triggered, and weak calculation bias ignored by the system may be continuously accumulated in training iteration, so that abnormal training results or model convergence failure is caused. Because such errors lack distinct features when they occur, conventional detection mechanisms tend to be difficult to identify in time, resulting in the error being hidden from propagation in the training system. Therefore, a new detection scheme is needed to reduce the risk of accumulation of the occult computing bias in the distributed training, so as to improve the stability and reliability of the whole training process. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention The disclosure aims to provide an abnormal node detection method, an abnormal node detection device, electronic equipment and a storage medium, which realize timely elimination and replacement of abnormal computing nodes through screening before training and continuous detection in training, thereby ensuring stability and reliability of a distributed training process. According to a first aspect of the present disclosure, there is provided an abnormal node detection method including: Before training starts, respectively performing first silent data damage detection on a plurality of original computing nodes in a distributed training system to obtain first detection results of the plurality of original computing nodes, and determining a plurality of target computing nodes participating in training according to the plurality of first detection results; In the distributed training process, respectively carrying out second silent data damage detection on a plurality of target computing nodes to obtain second detection results of the plurality of target computing nodes; Identifying abnormal computing nodes according to the second detection results, and eliminating the abnormal computing nodes from the target computing nodes; and acquiring training state data corresponding to the abnormal computing node, and recovering a distributed training process based on the training state data. In one exemplary embodiment of the present disclosure, the first silence data corruption detection comprises aggregate communication detection, first matrix multiplication detection, and model training detection; the first silent data damage detection is performed on a plurality of original computing nodes in the distributed training system respectively to obtain first detection results of the plurality of original computing nodes, and the method comprises the following steps: Aiming at any original computing node in the distributed training system, carrying out the set communication detection on the original computing node to obtain a first detection sub-result; Performing the first matrix multiplication detection on the original computing node to obtain a second detection sub-result; and under the condition that the first detection sub-result and the second detection sub-result both accord with a preset detection passing condition, carrying out model training detection on the original computing node based on a preset model training task to obtain the first detection result. In an exemplary embodiment of the present disclosure, the performing the aggregate communication detection on the original computing node, to obtain a first detection sub-result includes: Performing aggregation operation on the data to be aggregated of the original computing node to obtain first aggregated data; Transmitting the first aggregate data to other original computing nodes and receiving second aggregate data transmitted by the other original computing nodes, wherein the other original computing nodes are original computing nodes except the original computing nodes in the distributed training system; And comparing the first aggregate data and the received second aggregate data with a preset m