CN-121979724-A - Method and device for processing process faults in distributed training

CN121979724ACN 121979724 ACN121979724 ACN 121979724ACN-121979724-A

Abstract

The embodiment of the invention provides a method and a device for processing process faults in distributed training, which are applied to the technical field of distributed training, and the method comprises the steps of monitoring the state of a process in the process of the distributed training of a model; and responding to the existence of at least one first process in a fault state of a first communication group, sending a first instruction to at least one second process in a health state in the first communication group, wherein the first instruction is used for indicating the second process and continuing the distributed training based on a second communication group, and the second communication group is different from the first communication group. By the embodiment of the invention, the process faults in the distributed training based on the communication group reconstruction processing are realized, and the fault processing effect is improved.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251212

Claims (18)

1. A method for process failure handling in distributed training, the method comprising: monitoring the state of a process in the distributed training process of the model; and responding to the existence of at least one first process in a fault state of a first communication group, sending a first instruction to at least one second process in a health state of the first communication group, wherein the first instruction is used for instructing the second process to continue the distributed training based on a second communication group, and the second communication group is different from the first communication group.
2. The method of claim 1, wherein after sending the first instruction to at least one second process in a healthy state in the first communication group, the method further comprises: The second parameter slicing strategy is used for carrying out parameter slicing again by the second process to obtain a second parameter; and the second instruction is used for indicating the second process to continue the distributed training based on the second communication group and the second parameter.
3. The method of claim 2, further comprising determining a second parameter fragmentation policy based on the number of second processes.
4. A method according to claim 3, wherein determining a second parameter fragmentation policy based on the number of second processes comprises: determining global parameters of the distributed training; and carrying out parameter slicing again on the global parameters according to the number of the second processes to obtain a second parameter slicing strategy, wherein the second parameter slicing strategy is at least used for enabling the parameter blocks responsible for the first process to be redistributed to the second process.
5. The method of any of claims 1-4, wherein monitoring the state of each process during the distributed training of the model comprises: Receiving a notification message sent by a monitoring agent in the distributed training process of the model; And determining the first process in the fault state according to the notification message.
6. The method of claim 5, wherein the first process is determined by the monitoring agent based on a heartbeat status of each process to be monitored, and wherein the status of the process to be monitored is one of: The health state, when the process to be monitored is in the health state, the process to be monitored has no abnormal heartbeat state in a plurality of continuous detection periods; A suspected fault state, when the process to be monitored is in the suspected fault state, the process to be monitored has abnormal heartbeat state of a first time in the continuous multiple monitoring periods; And in a fault state, when the process to be monitored is in the fault state, the process to be monitored has abnormal heartbeat state of a second frequency in the continuous multiple detection periods, wherein the second frequency is larger than the first frequency.
7. The method of claim 5, wherein the method further comprises: the method comprises the steps of obtaining a mapping relation, wherein the mapping relation is used for describing a communication group corresponding to each process; And determining a first communication group mapped by the first process according to the mapping relation.
8. A method for process failure handling in distributed training, the method comprising: Receiving a first instruction in the distributed training process of the model, wherein the first instruction is an instruction sent to at least one second process in a health state in a first communication group when detecting that the first communication group has at least one first process in a fault state; and continuing the distributed training based on a second communication group in response to the first instruction, the second communication group being different from the first communication group.
9. The method of claim 8, wherein the method further comprises: receiving a second parameter fragmentation strategy; According to the second parameter slicing strategy, parameter slicing is carried out again to obtain a second parameter; A second instruction is received and, in response to the second instruction, the distributed training is continued based on the second communication group and the second parameter.
10. The method according to claim 9, wherein the method further comprises: discarding the optimizer state configured based on the first parameter; And carrying out initialization of the state of the optimizer again according to the second parameter.
11. The method according to claim 9, wherein the method further comprises: and in the process of continuing the distributed training, performing an aggregate communication operation based on the second communication group and the second parameter slicing strategy.
12. The method according to any of claims 9-11, wherein re-performing parameter slicing according to the second parameter slicing strategy to obtain second parameters comprises: Determining a parameter block required to be responsible for the second process from the global parameters of the distributed training according to the second parameter slicing strategy; and determining a second parameter according to the parameter block required to be responsible for the second process.
13. The method of any of claims 9-11, wherein the first instruction carries a communication group identification of a first communication group and a process identification list of the second process, and wherein continuing the distributed training based on a second communication group in response to the first instruction comprises: destroying the first communication group according to the communication group identification of the first communication group in response to the first instruction; and continuing the distributed training based on a second communication group according to the process identification list of the second process.
14. An apparatus for process failure handling in distributed training, the apparatus comprising: The state monitoring module is used for monitoring the state of the process in the distributed training process of the model; The first instruction sending module is used for responding to at least one first process in a fault state of a first communication group and sending a first instruction to at least one second process in a health state of the first communication group, the first instruction is used for instructing the second process to continue the distributed training based on a second communication group, and the second communication group is different from the first communication group.
15. An apparatus for process failure handling in distributed training, the apparatus comprising: The first instruction receiving module is used for receiving a first instruction in the distributed training process of the model, wherein the first instruction is an instruction sent to at least one second process in a health state in a first communication group when at least one first process in a fault state exists in the first communication group; And the first instruction response module is used for responding to the first instruction and continuing the distributed training based on a second communication group, wherein the second communication group is different from the first communication group.
16. An electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the method of any one of claims 1 to 13.
17. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 13.
18. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 13.

Description

Method and device for processing process faults in distributed training Technical Field The invention relates to the technical field of distributed training, in particular to a method and a device for processing process faults in distributed training. Background In the distributed training of the model, a plurality of processes are generally adopted to perform parameter updating tasks, and when a process fault occurs, a fault-tolerant scheme based on a periodic Check point (Check point) or a fault-tolerant scheme based on a duplicate Group (duplicate Group) is generally adopted, but the fault-tolerant scheme has poor fault handling effect. Disclosure of Invention In view of the above problems, a method and an apparatus for process failure handling in distributed training are provided, including: A method of process failure handling in distributed training, the method comprising: monitoring the state of a process in the distributed training process of the model; And responding to the existence of at least one first process in a fault state of a first communication group, sending a first instruction to at least one second process in a health state in the first communication group, wherein the first instruction is used for indicating the second process and continuing the distributed training based on a second communication group, and the second communication group is different from the first communication group. Optionally, after sending the first instruction to at least one second process in a healthy state in the first communication group, the method further comprises: The second parameter slicing strategy is used for carrying out parameter slicing again by the second process to obtain a second parameter; and the second instruction is used for indicating the second process to continue the distributed training based on the second communication group and the second parameter. Optionally, the method further comprises determining a second parameter slicing strategy according to the number of the second processes. Optionally, determining a second parameter slicing strategy according to the number of the second processes includes: determining global parameters of the distributed training; and carrying out parameter slicing again on the global parameters according to the number of the second processes to obtain a second parameter slicing strategy, wherein the second parameter slicing strategy is at least used for enabling the parameter blocks responsible for the first process to be redistributed to the second process. Optionally, during the distributed training of the model, monitoring the state of each process includes: Receiving a notification message sent by a monitoring agent in the distributed training process of the model; And determining the first process in the fault state according to the notification message. Optionally, the first process is determined by the monitoring agent according to the heartbeat state of each process to be monitored, and the state of the process to be monitored is one of the following: The health state, when the process to be monitored is in the health state, the process to be monitored has no abnormal heartbeat state in a plurality of continuous detection periods; A suspected fault state, when the process to be monitored is in the suspected fault state, the process to be monitored has abnormal heartbeat state of a first time in the continuous multiple monitoring periods; And in a fault state, when the process to be monitored is in the fault state, the process to be monitored has abnormal heartbeat state of a second frequency in the continuous multiple detection periods, wherein the second frequency is larger than the first frequency. Optionally, the method further comprises: the method comprises the steps of obtaining a mapping relation, wherein the mapping relation is used for describing a communication group corresponding to each process; And determining a first communication group mapped by the first process according to the mapping relation. A method of process failure handling in distributed training, the method comprising: Receiving a first instruction in the distributed training process of the model, wherein the first instruction is an instruction sent to at least one second process in a health state in a first communication group when detecting that the first communication group has at least one first process in a fault state; and continuing the distributed training based on a second communication group in response to the first instruction, the second communication group being different from the first communication group. Optionally, the method further comprises: receiving a second parameter fragmentation strategy; According to the second parameter slicing strategy, parameter slicing is carried out again to obtain a second parameter; A second instruction is received and, in response to the second instruction, the distributed training is continued based on the second commun