CN-121997994-A - Data communication method, distributed system and related equipment

CN121997994ACN 121997994 ACN121997994 ACN 121997994ACN-121997994-A

Abstract

The application provides a data communication method, a distributed system and related equipment, wherein before the distributed training of an AI model starts, the distributed system determines the number M of concurrent cards adapting to cross-regional communication in advance based on the communication bandwidth converged among regions, so that after the distributed training starts, if the system needs to collect communication, the number M of processors in each cross-regional communication is ensured, the communication volume of M processors in each cross-regional communication is ensured not to exceed the cross-regional bandwidth, thereby ensuring that the cross-regional communication is not congested, avoiding the problems of back pressure mechanism caused by congestion, further reduction of the cross-regional communication bandwidth and the like, and further improving the overall communication efficiency of the collection communication.

Inventors

CHEN HUAN
ZHANG JINNIAN
LIU YANLIN

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (19)

1. A data communication method, the method being applied to a distributed system, the distributed system comprising a first region and a second region, each region comprising a plurality of processors, the distributed system for performing a distributed training task of an AI model, each processor for performing a sub-task of the distributed training task, the method comprising: A first processor in the first area acquires first data to be communicated; the first processor performs intra-area communication with other processors in the first area, sends the first data to be communicated to the other processors in the first area, receives the data sent by the other processors in the first area, and obtains an intra-area communication result; The first processor and a second processor in the second area perform cross-regional communication, the intra-area communication result is sent to the second processor, data sent by the second processor is received to obtain a cross-regional communication result, wherein when the first processor and the second processor perform cross-regional communication, M-1 processors in the first area also perform cross-regional communication, and the total communication bandwidth of M processors formed by the first processor and the M-1 processors is matched with the total communication bandwidth between the first area and the second area; And the first processor performs intra-area communication with other processors in the first area, sends the trans-regional communication result to the other processors in the first area, and receives data sent by the other processors in the first area to obtain an aggregate communication result.
2. The method of claim 1, wherein each of the regions comprises M communication groups, each of the M communication groups comprising at least one processor, the M processors participating in the cross-region communication grouping different communication groups.
3. The method of claim 2, wherein the type of communication of the cross-zone communication comprises a sequential communication, the method further comprising, after the first processor performs the cross-zone communication with the second processor in the second zone: The first processor sends a notification message to a third processor, wherein the notification message is used for indicating the third processor to perform cross-region communication with a fourth processor in the second region, the first processor and the third processor belong to a first communication group, and the second processor and the fourth processor belong to a second communication group.
4. The method of claim 2, wherein the communication type of the cross-zone communication comprises proxy communication, and wherein the first processor performing a first sub-zone communication with other processors in the first zone comprises: The first processor and the second processor in the second area perform multi-time cross-region communication, wherein the first processor is a proxy processor of a first communication group, the first processor comprises data to be communicated of all processors in the first communication group, and M proxy processors participate in cross-region communication in total in each time of cross-region communication.
5. The method of claim 4, wherein the first processor performs intra-area communication with other processors in the first area, sends the first data to be communicated to other processors in the first area, receives data sent by other processors in the first area, and obtaining an intra-area communication result comprises: The first processor performs inter-group communication with the processors of other communication groups in the first area, sends the data to be communicated to the processors of other communication groups in the first area, receives the data sent by the processors of other communication groups, and obtains an inter-group communication result; and the first processor performs intra-group communication with other processors in the first communication group, sends the inter-group communication result to the other processors in the first communication group, receives data sent by the other processors in the first communication group, and obtains the intra-area communication result.
6. The method of any of claims 3 to 5, wherein each region includes at least one computing node, the at least one computing node including a network communication library and at least one processor, the method further comprising, prior to the first processor in the first region acquiring the first data to be communicated: The first processor receives a communication policy sent by a network communication library of a first computing node, wherein the first computing node comprises the first processor, the communication policy is determined by the network communication library based on the number M of communication groups, and the number M of the communication groups is determined by the network communication library based on network topology information of the distributed system after determining the total communication bandwidth between the first area and the second area, and the communication type and the processing bandwidth of each processor are combined.
7. A data communication method, the method being applied to a distributed system comprising a first region and a second region, each region comprising a plurality of computing nodes, each computing node comprising a network communication library and at least one processor, the distributed system being for performing a distributed training task of an AI model, each processor being for performing a subtask of the distributed training task, the method comprising: a network communication library of a first computing node in the plurality of computing nodes acquires network topology information of the distributed system; The network communication library of the first computing node determines the total bandwidth of communication between the first area and the second area and the bandwidth of each processor based on the network topology information; The network communication library of the first computing node generates a communication strategy based on the total communication bandwidth and the bandwidth of each processor, wherein the communication strategy comprises that when the processor of the first area and the processor of the second area perform cross-regional communication each time, M processors participate in cross-regional communication in the first area, and the total bandwidth of the M processors is suitable for the total communication bandwidth between the first area and the second area; the network communication library of the first computing node transmits the communication policy to at least one processor on the first computing node.
8. The method of claim 7, wherein each of the regions comprises M communication groups, each of the M communication groups comprising at least one processor, and wherein the communication policy comprises M processors participating in cross-region communication to group different communication groups.
9. A distributed system comprising a first region and a second region, each region comprising a plurality of processors, the distributed system for performing a distributed training task of an AI model, each processor for performing a subtask of the distributed training task, the distributed system comprising a first processor in the first region and a second processor in the second region; The first processor is used for acquiring first data to be communicated; The second processor is used for acquiring second data to be communicated; the first processor is configured to perform intra-area communication with other processors in the first area, send the first data to be communicated to the other processors in the first area, receive the data sent by the other processors in the first area, and obtain a communication result in the first area; The second processor is configured to perform intra-area communication with other processors in the second area, send the second data to be communicated to the other processors in the second area, and receive the data sent by the other processors in the second area, so as to obtain a communication result in the second area; The first processor is used for performing cross-region communication with the second processor, sending the communication result in the first region to the second processor, and receiving the data sent by the second processor to obtain a first cross-region communication result; The second processor is configured to perform cross-region communication with the first processor, send a communication result in the second region to the first processor, and receive data sent by the first processor to obtain a second cross-region communication result, where when the first processor performs cross-region communication with the second processor, M-1 processors in the first region also perform cross-region communication, and a total communication bandwidth of M processors formed by the first processor and the M-1 processors is adapted to a total communication bandwidth between the first region and the second region; The first processor is configured to perform intra-area communication with other processors in the first area, send the first cross-area communication result to the other processors in the first area, and receive data sent by the other processors in the first area to obtain a set communication result; The second processor is configured to perform intra-area communication with other processors in the second area, send the second cross-area communication result to the other processors in the second area, and receive data sent by the other processors in the second area to obtain a set communication result.
10. The distributed system of claim 9 wherein each of the regions includes M communication groups, each of the M communication groups including at least one processor, the M processors participating in the cross-region communication grouping different communication groups.
11. The distributed system of claim 10, wherein the communication type of the cross-regional communication comprises sequential communication; The first processor is configured to send a notification message to a third processor, where the notification message is configured to instruct the third processor to perform cross-region communication with a fourth processor in the second area, where the first processor and the third processor belong to a first communication group, and the second processor and the fourth processor belong to a second communication group; The second processor is configured to send a notification message to the fourth processor, where the notification message is configured to instruct the fourth processor to perform cross-regional communication with the third processor in the first region.
12. The distributed system of claim 11, wherein the communication type of the cross-regional communication comprises proxy communication; The first processor is used for performing multi-time cross-region communication with the second processor, wherein the first processor is a proxy processor of a first communication group, the first processor comprises data to be communicated of all processors in the first communication group, and M proxy processors participate in cross-region communication in total in each time of cross-region communication; The second processor is configured to perform multi-time cross-regional communication with the first processor, where the second processor is a proxy processor of a second communication group, the second processor includes data to be communicated of all processors in the second communication group, and M proxy processors participate in cross-regional communication in total in the second area during each cross-regional communication.
13. The distributed system of claim 12, wherein the plurality of distributed systems, The first processor is configured to perform inter-group communication with the processors of the other communication groups in the first area, send the first data to be communicated to the processors of the other communication groups in the first area, receive the data sent by the processors of the other communication groups, obtain a first inter-group communication result, perform intra-group communication with the other processors of the first communication group, send the first inter-group communication result to the other processors of the first communication group, receive the data sent by the other processors of the first communication group, and obtain the first intra-area communication result; The second processor is configured to perform inter-group communication with the processors of the other communication groups in the second area, send the two pieces of data to be communicated to the processors of the other communication groups in the second area, receive the data sent by the processors of the other communication groups, obtain two inter-group communication results, perform intra-group communication with the other processors of the second communication group, send the two inter-group communication results to the other processors of the second communication group, receive the data sent by the other processors of the second communication group, and obtain the intra-area communication results.
14. The distributed system of any of claims 11 to 13, wherein each region comprises at least one computing node comprising a network communications library and at least one processor; The first processor is configured to receive a communication policy sent by a network communication library of a first computing node, where the first computing node includes the first processor, where the communication policy is determined by the network communication library based on a number M of communication groups, where the number M of communication groups is determined by combining a communication type and a processing bandwidth of each processor after determining a total communication bandwidth between the first area and the second area by using network topology information of the distributed system; the second processor is configured to receive a communication policy sent by a network communication library of a second computing node, where the second computing node includes the second processor, and the communication policy is determined by the network communication library based on the number M of communication groups.
15. A network communication library, the network communication library being applied to a distributed system, the distributed system comprising a first region and a second region, each region comprising a plurality of computing nodes, each computing node comprising the network communication library and at least one processor, the distributed system for performing a distributed training task of an AI model, each processor for performing a subtask of the distributed training task, the network communication library comprising: an acquisition unit, configured to acquire network topology information of the distributed system; A determining unit configured to determine a total bandwidth of communication between the first area and the second area and a bandwidth of each processor based on the network topology information; The determining unit is configured to generate a communication policy based on the total communication bandwidth and the bandwidth of each processor, where the communication policy includes that, each time a processor in the first area performs cross-regional communication with a processor in the second area, the first area shares M processors to participate in cross-regional communication, and a total bandwidth of the M processors is adapted to a total communication bandwidth between the first area and the second area; And a sending unit, configured to send the communication policy to at least one processor on the first computing node.
16. The network communication library of claim 15, wherein each zone comprises M communication groups, each of the M communication groups comprising at least one processor, the communication policy comprising M processors participating in cross-zone communication to group different communication groups.
17. A chip, characterized in that the chip comprises a power supply unit and a processing unit, the power supply unit being adapted to supply power to the processing unit such that the processing unit implements the method according to any of claims 1 to 6.
18. A computer readable storage medium comprising computer program instructions which, when executed by a computing device, perform the method of claim 7 or 8.
19. A computer program product containing instructions that, when executed by a computing device, cause the computing device to perform the method of claim 7 or 8.

Description

Data communication method, distributed system and related equipment Technical Field The present application relates to the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), and more particularly to a data communication method, a distributed system, and related devices. Background Along with the rapid development of the AI model, the parameter scale of the large model and the data set of training are larger and larger, the scale of the distributed system required by the large model training is also larger and larger, the hardware cost of the model training is obviously increased, and in order to reduce the hardware cost, a networking form with bandwidth convergence is gradually developed. Bandwidth convergence refers to the existence of multiple areas in a distributed system, the areas have higher bandwidths, and the bandwidths among ridge (spine) nodes connected with different areas are relatively smaller, so that the number of switches is reduced, the number of optical fibers and cables is optimized, and the hardware cost can be reduced. However, due to bandwidth convergence between the spine nodes, when data traffic is too large, congestion occurs in inter-regional communication, in order to cope with the congestion, a network protocol generally triggers a back pressure mechanism (backpressure), a sender can reduce the data transmission rate, so that the effective bandwidth of inter-regional interconnection is further reduced, the efficiency of overall aggregate communication is further affected, and the overall training efficiency of a model is affected. Disclosure of Invention The application provides a data communication method, a distributed system and related equipment, which are used for solving the problem that the distributed system has low cross-regional communication efficiency, so that the overall efficiency of aggregate communication is affected. In a first aspect, a data communication method is provided, the method is applied to a distributed system, the distributed system includes a first area and a second area, each area includes a plurality of processors, the distributed system is used for executing a distributed training task of an AI model, each processor is used for executing a subtask of the distributed training task, the method includes the steps that a first processor in the first area obtains first data to be communicated, the first processor performs intra-area communication with other processors in the first area, the first data to be communicated is sent to the other processors in the first area, data sent by the other processors in the first area are received, communication results in the area are obtained, the first processor performs inter-area communication with the second processor in the second area, communication results in the area are sent to the second processor, data sent by the second processor are obtained, inter-area communication results are obtained, when the first processor performs inter-area communication with the second processor, M-1 processors in the first area also perform inter-area communication, total bandwidth formed between the first processor and M-1 processors and the first processor in the other areas are communicated, the total bandwidth between the first processor and the first processor in the first area is formed between the first processor and the other processors in the first area is communicated with the other areas, the total bandwidth is obtained, the total bandwidth between the first processor and the other processors in the first area is formed. Before the distributed training of the AI model starts, the distributed system determines the number M of concurrent cards adapting to the cross-regional communication in advance based on the communication bandwidth converged between the regions, so that after the distributed training is started, if the number M of processors in each cross-regional communication is needed, the communication traffic of the M processors in the cross-regional communication is ensured not to exceed the cross-regional bandwidth, congestion can be avoided in the cross-regional communication, the problems of back pressure mechanism caused by congestion, further reduction of the cross-regional communication bandwidth and the like are avoided, and the overall communication efficiency of the integrated communication is improved. In one possible implementation, each region includes M communication groups, each of the M communication groups including at least one processor, the M processors participating in the cross-region communication grouping different communication groups. In the implementation manner, only 1 processor in each communication group communicates externally during the cross-regional communication, so that M processors in M communication groups simultaneously perform the cross-regional communication, and the traffic volume during each cross-regional communication does not exceed the cross-region