CN-122027434-A - Collective communication method, collective communication system and related device

CN122027434ACN 122027434 ACN122027434 ACN 122027434ACN-122027434-A

Abstract

The embodiment of the application discloses an aggregate communication method which is applied to an aggregate communication system, wherein the aggregate communication system comprises a management node and N computing nodes, the N computing nodes are used for processing first communication tasks in parallel, N is an integer larger than 1, the method comprises the steps that the management node receives first information, the first information indicates that at least one first link breaks down, each first link is a communication link used for processing the first communication tasks among the N computing nodes, the management node receives N pieces of second information from the N computing nodes, each piece of second information is used for indicating whether original data of the first communication tasks stored locally by one computing node are changed, and if the original data of the N computing nodes are not changed, the management node sends third information to the N computing nodes, and indicates the N computing nodes to reprocess the first communication tasks based on the original data. In this way, the time for service restoration can be reduced after a communication failure occurs.

Inventors

CHEN CHAO
ZHANG WENBIN
XU BIN

Assignees

华为技术有限公司

Dates

Publication Date: 20260512
Application Date: 20241112

Claims (20)

1. A collective communication method, characterized in that the method is applied to a collective communication system including a management node and N computing nodes for processing a first communication task in parallel, N being an integer greater than 1, the method comprising: The management node receives first information, wherein the first information is used for indicating at least one first link to fail, and each first link is one communication link used for processing the first communication task among the N computing nodes; The management node receives N pieces of second information from the N computing nodes, wherein each piece of second information is used for indicating whether the original data of the first communication task locally stored by one computing node is changed or not; And if the original data of the N computing nodes are not changed, the management node sends third information to the N computing nodes, wherein the third information is used for indicating the N computing nodes to reprocess the first communication task based on the original data.
2. The method of claim 1, wherein the first link is a communication link between a first computing node and a second computing node, the first computing node and the second computing node being two computing nodes of the N computing nodes, the third information further being for indicating a second link, the second link being a communication link between the first computing node and the second computing node for reprocessing the first communication task.
3. The method of claim 2, wherein the third information comprises an identification of the second link.
4. A method according to claim 2 or 3, characterized in that the first information is specifically used for indicating that the first link has an optical module failure.
5. The method of claim 4, wherein the second link indicated by the third information is the first link.
6. The method of any of claims 2-5, wherein the first computing node comprises a first portal and a second portal, the first link being a communication link based on the first portal, the second link being a communication link based on the second portal.
7. The method of any of claims 1-6, wherein before the management node receives N second information from the N computing nodes, the method further comprises: the management node sends fourth information to each of the N computing nodes, each of the fourth information being for requesting the second information.
8. A collective communication method, characterized in that the method is applied to a collective communication system including a management node and N computing nodes for processing a first communication task in parallel, N being an integer greater than 1, the method comprising: A first computing node sends first information to the management node, wherein the first information is used for indicating that the first link fails, the first link is a communication link between the first computing node and a second computing node and is used for processing the first communication task, and the first computing node and the second computing node are two computing nodes in the N computing nodes; The first computing node sends second information to the management node, wherein the second information is used for indicating whether the original data of the first communication task locally stored by the first computing node is changed or not; The first computing node receives third information from the management node, the third information being used for indicating the first computing node to reprocess the first communication task based on the original data, wherein the third information is determined by the management node based on that none of the original data of the N computing nodes is changed; the first computing node reprocesses the first communication task according to the third information and the original data.
9. The method of claim 8, wherein before the first computing node sends the first information to the management node, the method further comprises: the first computing node receives first indication information, wherein the first indication information indicates that data transmission of the first communication task fails; The first computing node determines that the first link fails according to the first indication information.
10. The method according to claim 8 or 9, wherein the third information is further used to indicate a second link, the second link being a data transmission link between the first computing node and the second computing node for reprocessing the first communication task.
11. The method of claim 10, wherein the third information comprises an identification of the second link.
12. The method according to claim 10 or 11, wherein the first information is specifically used for indicating that the first link has an optical module failure.
13. The method of claim 12, wherein the second link indicated by the third information is the first link.
14. The method of any of claims 10-12, wherein the first computing node comprises a first portal and a second portal, the first link being a data transmission link based on the first portal, the second link being a data transmission link based on the second portal.
15. The method of any of claims 8-14, wherein before the first computing node sends second information to the management node, the method further comprises: The first computing node receives fourth information from the management node, the fourth information being used to request the second information.
16. The aggregate communication system is characterized by comprising a management node and N computing nodes, wherein the N computing nodes are used for processing a first communication task in parallel, and N is an integer greater than 1; When at least one first link fails, the management node is configured to receive N pieces of first information from the N computing nodes, where each piece of first information is used to indicate whether original data of the first communication task locally stored by one computing node is changed, and each first link is one communication link between the N computing nodes for processing the first communication task; If the original data of the N computing nodes are not changed, the management node sends second information to the N computing nodes, wherein the second information is used for indicating the N computing nodes to reprocess the first communication task based on the original data; The N computing nodes reprocess the first communication task based on the second information and the raw data.
17. The communication device is characterized in that the communication device is applied to an aggregate communication system, the aggregate communication system comprises a management node and N computing nodes, the N computing nodes are used for processing a first communication task in parallel, N is an integer greater than 1, the communication device is the management node, and the communication device comprises a receiving and transmitting unit and a processing unit; The transceiver unit is configured to perform the transmitting step or the receiving step in the method of any one of the preceding claims 1 to 7; The processing unit is adapted to perform the steps of the method of any of the preceding claims 1 to 7, except for the transmitting step and the receiving step.
18. A communication device, wherein the communication device is applied to an aggregate communication system, the aggregate communication system comprises a management node and N computing nodes, the N computing nodes are used for processing a first communication task in parallel, N is an integer greater than 1, the communication device is a first computing node, the first computing node is one computing node of the N computing nodes, and the communication device comprises a transceiver unit and a processing unit; the transceiver unit being configured to perform the transmitting or receiving steps in the method of any one of the preceding claims 8 to 15; The processing unit is adapted to perform steps of the method of any of the preceding claims 8 to 15, except for the transmitting step and the receiving step.
19. A communication device comprising at least one processor, the at least one processor coupled to a memory; The memory is used for storing programs or instructions; The at least one processor is configured to execute the program or instructions to cause the apparatus to implement the method of any one of claims 1 to 7 or to cause the apparatus to implement the method of claim 8 or 15.
20. The communication device of claim 19, wherein the communication device is a chip or a system-on-chip.

Description

Collective communication method, collective communication system and related device Technical Field The present application relates to the field of communications, and in particular, to an aggregate communication method, an aggregate communication system, and a related device. Background To shorten the training period of the AI training task, accommodate larger models, a large scale AI cluster is required for training. With the increase of AI clusters, the number of devices (switching devices, optical modules, etc.) is greatly increased, and the failure rate of single devices can be overlapped, so that the stability of the overall clusters is continuously reduced with the increase of the use scale. Thus, the probability of the training task being interrupted due to a communication failure caused by hardware causes is becoming higher and higher. AI training is advanced by updating intermediate variables in multiple iterations, and upon communication failure, relies primarily on the checkpoint mechanism to fall back to a previously saved state to restart training. However, this manner of recovery requires a pause in training, migration of data to the compute node, and restart, which is costly. Disclosure of Invention The embodiment of the application provides an aggregate communication method, an aggregate communication system and a related device, which are used for reducing service recovery time when communication faults occur. In a first aspect, an embodiment of the present application provides an aggregate communication method, where the aggregate communication system includes a management node and N computing nodes, where the N computing nodes are configured to process first communication tasks in parallel, N is an integer greater than 1, and the method includes the management node receiving first information, where the first information is configured to indicate that at least one first link is faulty, each first link is one communication link between the N computing nodes and is configured to process the first communication tasks, the management node receiving N second information from the N computing nodes, where each second information is configured to indicate whether original data of the first communication tasks locally stored by one computing node is modified, and if none of the original data of the N computing nodes is modified, the management node sending third information to the N computing nodes, where the third information is configured to indicate that the N computing nodes reprocess the first communication tasks based on the original data. The collective communication method in the application focuses on a recovery mechanism when a fault is encountered in the communication process. The computing node is a GPU card or an NPU card, and the management node is a Host CPU. When the network side fails, the data transmission between the computing nodes fails, the first communication task is interrupted, and failure information is returned to the computing nodes to indicate the first link to send the failure. And the computing nodes report the information of communication failure to the management nodes, and the management nodes negotiate the states of the computing nodes, namely, whether the original data locally stored in each computing node is changed or not is determined. Each computing node reports the detection result to the management node after detection. The management node does re-execute the aggregate communication operator level according to the detection result. In one possible implementation, each GPU card or NPU card corresponds to a Host CPU, and after receiving the fault information, the Host CPU needs to negotiate with other Host CPUs to determine whether to execute the first communication task again at the level of the aggregate communication operator. By adopting the method, a re-execution mechanism of the aggregate communication operator is newly added, and the stability of the AI cluster is enhanced. After a communication failure, the aggregate communication system only needs to rollback one aggregate communication operator, and the loss is about in the second level, so that the computing resources and time required for service recovery are reduced. In one possible implementation, the first link is a communication link between a first computing node and a second computing node, the first computing node and the second computing node are two computing nodes of the N computing nodes, the third information is further used to indicate a second link, and the second link is a communication link between the first computing node and the second computing node for reprocessing the first communication task. In the present application, there is at least one backup second link between the first computing node and the second computing node. The first computing node uses the backup link through the borrowing mechanism and re-performs the first communication task. Wherein the third information