Search

WO-2026091617-A1 - COLLECTIVE COMMUNICATION METHOD AND COMPUTING DEVICE CLUSTER

WO2026091617A1WO 2026091617 A1WO2026091617 A1WO 2026091617A1WO-2026091617-A1

Abstract

The present application relates to the field of computer technology, and discloses a collective communication method and a computing device cluster. The method comprises: during the process of N computing units in a computing device cluster executing a collective communication operation, a first computing unit receives first data from a second computing unit and/or sends second data to a third computing unit, wherein when data in each computing unit among the N computing units is greater than a first threshold, the data in each computing unit is segmented into N pieces of slice data, the number of pieces of slice data in the second computing unit comprised in the first data is inversely proportional to a physical distance between the first computing unit and the second computing unit, and the number of pieces of slice data in the first computing unit comprised in the second data is inversely proportional to a physical distance between the first computing unit and the third computing unit. Thus, cross‑switch traffic conflicts during the process of N computing units executing a collective communication operation can be reduced, and the efficiency of collective communication can be improved.

Inventors

  • XU, BIN
  • CHEN, YI
  • LI, Biaozhi

Assignees

  • 华为技术有限公司

Dates

Publication Date
20260507
Application Date
20250628
Priority Date
20241028

Claims (17)

  1. A clustered communication method, characterized in that it is applied to a computing device cluster, the computing device cluster comprising N computing units, where N is a positive integer; the method includes: During the collective communication operation performed by the N computing units, the first computing unit receives first data from the second computing unit and/or sends second data to the third computing unit; Wherein, if the data in each of the N computing units is greater than a first threshold, the data in each computing unit is divided into N slices of data. The number of slices of data in the second computing unit included in the first data is inversely proportional to a first physical distance, and the number of slices of data in the first computing unit included in the second data is inversely proportional to a second physical distance. The first physical distance is the physical distance between the first computing unit and the second computing unit, and the second physical distance is the physical distance between the first computing unit and the third computing unit.
  2. According to the method of claim 1, the N computing units are adjacent in number or the computing units with the smallest and largest in number are adjacent computing units, and the physical distance between any one of the N computing units and an adjacent computing unit is less than the physical distance between any one computing unit and a non-adjacent computing unit.
  3. According to the method of claim 2, the data of each of the N computing units is greater than the first threshold, and the aggregate communication operation includes a global collection operation, which includes P steps. As a rounding operator, the method further includes: During the (K+1)th step of the global collection operation performed by the N computing units, the first computing unit receives the first data from the second computing unit and sends the second data to the third computing unit; Wherein, the first calculation unit is the (i+1)th calculation unit among the N calculation units, the second calculation unit is the (i-2 PK-1 + N)%N+1th calculation unit among the N calculation units, the third calculation unit is the (i+2 PK-1 )%N+1th calculation unit among the N calculation units, K is a natural number less than or equal to P-1, i is a natural number less than or equal to N-1, and % is the modulo operator; The first data includes D slice data from the second computing unit, and the second data includes D slice data from the first computing unit. round(*) is the rounding operator.
  4. According to the method of claim 2, the data of each of the N computing units is greater than the first threshold, the set communication operation includes a reduction-distribution operation, and the reduction-distribution operation includes P steps. As a rounding operator, the method further includes: During the K+1 step of the reduction and distribution operation performed by the N computing units, the first computing unit receives the first data from the second computing unit and sends the second data to the third computing unit; Wherein, the first calculation unit is the (i+1)th calculation unit among the N calculation units, the second calculation unit is the (i+ 2K )%N+1th calculation unit among the N calculation units, and the third calculation unit is the (i- 2k +N)%N+1th calculation unit among the N calculation units, where K is a natural number less than or equal to P-1, i is a natural number less than or equal to N-1, and % is the modulo operator; The first data includes D slice data from the second computing unit, and the second data includes D slice data from the first computing unit. round(*) is the rounding operator.
  5. According to the method of claim 2, the data of each of the N computing units is greater than the first threshold, the aggregate communication operation includes a distribution operation, and the distribution operation includes P steps. As a rounding operator, the method further includes: During the (K+1)th step of the distributed operation performed by the N computing units, if the sequence distance between the first computing unit and the fourth computing unit is greater than or equal to a second threshold and less than a third threshold, then the first computing unit receives the first data from the second computing unit; or, During the (K+1)th step of the distributed operation performed by the N computing units, if the sequence number distance between the first computing unit and the fourth computing unit is less than the second threshold, then the first computing unit sends the second data to the third computing unit. Wherein, the first calculation unit is the (i+1)th calculation unit among the N calculation units, the second calculation unit is the (i+ 2K )%N+1th calculation unit among the N calculation units, the third calculation unit is the (i- 2K +N)%N+1th calculation unit among the N calculation units, and the fourth calculation unit is the calculation unit that stores data in the N calculation units before the N calculation units perform the scatter operation, where K is a natural number less than or equal to P-1, i is a natural number less than or equal to N-1, and % is the modulo operator; The first data includes D slice data from the second computing unit, and the second data includes D slice data from the first computing unit. round(*) is the rounding operator.
  6. The method according to claim 5, characterized in that, When the fourth calculation unit is the (j+1)th calculation unit among the N calculation units, the index distance between the first calculation unit and the fourth calculation unit satisfies the following relationship: W1=(j-i+N)%N Where W1 is the sequence distance between the first calculation unit and the fourth calculation unit, and j is a natural number less than or equal to N-1; When N is not a power of 2 and K equals P-1, the second threshold and the third threshold satisfy the following relationship: When N is a power of 2 or K is not equal to P-1, the second threshold and the third threshold satisfy the following relationship: Wherein, Q1 is the second threshold and Q2 is the third threshold.
  7. The method according to claim 3 is characterized in that the first data includes the ( d1-2PK ·m)%N+1th slice data in the second computing unit, and the second data includes the ( d2-2PK ·m)%N+1th slice data in the first computing unit; Where d1 = (i - 2PK-1 + N)%N, d2 = i, and m is a natural number less than D.
  8. The method according to any one of claims 4-5 is characterized in that the first data includes the ( d1-2K +1 ·m)%N+1th slice data in the second computing unit, and the second data includes the ( d2-2K +1 ·m)%N+1th slice data in the first computing unit. Where d1 = i, d2 = (i - 2K + N)%N, and m is a natural number less than D.
  9. The method according to any one of claims 3-5 is characterized in that all or part of the D slice data are slice data with consecutive serial numbers.
  10. According to the method of claim 2, the data of each of the N computing units is less than or equal to the first threshold, and the set communication operation includes a one-step reduction operation, which includes P steps. As a rounding operator, the method further includes: During the (K+1)th step of the reduction operation performed by the N computing units, if the sequence distance between the first computing unit and the first computing unit among the N computing units satisfies a first relationship, then the first computing unit receives the first data from the second computing unit; or, During the (K+1)th step of the reduction operation performed by the N computing units, if the sequence distance between the first computing unit and the first computing unit among the N computing units satisfies the second relationship, then the first computing unit sends the second data to the third computing unit. Wherein, the first calculation unit is the (i+1)th calculation unit among the N calculation units, the second calculation unit is the (i+ 2K )%N+1th calculation unit among the N calculation units, and the third calculation unit is the (i- 2K +N)%N+1th calculation unit among the N calculation units, where K is a natural number less than or equal to P-1, i is a natural number less than or equal to N-1, and % is the modulo operator; The first data includes all the data in the second computing unit, and the second data includes all the data in the first computing unit.
  11. The method according to claim 10, characterized in that, The first relationship is specifically: The second relationship is specifically as follows: Wherein, W2 is the sequence distance between the first computing unit and the first computing unit among the N computing units.
  12. According to the method of claim 2, the data of each of the N computing units is less than or equal to the first threshold, and the aggregate communication operation includes a one-step broadcast operation, which includes P steps. As a rounding operator, the method further includes: During the (K+1)th step of the broadcast operation performed by the N computing units, if the sequence distance between the first computing unit and the fifth computing unit satisfies the third relationship, then the first computing unit receives the first data from the second computing unit; or, During the (K+1)th step of the broadcast operation performed by the N computing units, if the sequence distance between the first computing unit and the fifth computing unit satisfies the fourth relationship, then the first computing unit sends the second data to the third computing unit. Wherein, the first calculation unit is the (i+1)th calculation unit among the N calculation units, the second calculation unit is the (i-2 PK-1 + N)%N+1th calculation unit among the N calculation units, the third calculation unit is the (i+2 PK-1 )%N+1th calculation unit among the N calculation units, and the fifth calculation unit is the calculation unit that stores the data to be broadcast in the N calculation units before the N calculation units perform the one-step broadcast operation, where K is a natural number less than or equal to P-1, i is a natural number less than or equal to N-1, and % is the modulo operator; The first data includes all the data in the second computing unit, and the second data includes all the data in the first computing unit.
  13. The method according to claim 12, characterized in that, When the fifth calculation unit is the (j+1)th calculation unit among the N calculation units, the third relation is specifically as follows: The fourth relationship is as follows: Where j is a natural number less than or equal to N-1, and W2 is the ordinal distance between the first calculation unit and the fifth calculation unit.
  14. The method according to any one of claims 1-13 is characterized in that the aggregate communication operation includes one or more of the following: reduction-dispersion operation, global collection operation, global reduction operation, dispersion operation, broadcast operation, one-step reduction operation, and one-step broadcast operation; Wherein, if the data in each of the N computing units is greater than the first threshold, the global reduction operation includes the reduction-distribution operation and the global collection operation, and the broadcast operation includes the distribution operation and the global collection operation; or... When the data in each of the N computing units is less than or equal to the first threshold, the global reduction operation includes the one-step reduction operation and the one-step broadcast operation, and the broadcast operation includes the one-step broadcast operation.
  15. A computing device cluster, characterized in that the computing device cluster includes at least one computing device, each of the at least one computing device including a processor and a memory; the memory is used to store computer program instructions; The processor invokes computer program instructions stored in the memory to execute the method as described in any one of claims 1-14.
  16. A computer-readable storage medium, characterized in that it includes computer program instructions, which, when executed by a computing device, perform the method as described in any one of claims 1-14.
  17. A computer program product comprising instructions, characterized in that, when the instructions are executed by a computing device, the computing device performs the method as described in any one of claims 1-14.

Description

A clustered communication method and computing device cluster Cross-reference of related applications This application claims priority to Chinese Patent Application No. 202411518348.X, filed on October 28, 2024, entitled "A Cluster Communication Method and Computing Device Cluster", the entire contents of which are incorporated herein by reference. Technical Field This application relates to the field of computer technology, and in particular to a clustered communication method and a cluster of computing devices. Background Technology In the process of training artificial intelligence (AI) models or high-performance computing (HPC), collective communication operations are required to complete data distribution or synchronization between multiple computing units in a computing device cluster. Currently, aggregate communication operations can be implemented using communication algorithms such as the ring algorithm and the halving doubling (HD) algorithm. The ring algorithm connects multiple computing units into a ring structure, where communication occurs between these units. Each unit only communicates with its left and right adjacent units. During ring communication, data is transmitted along the ring structure until communication is complete and the data reaches the destination unit. The HD algorithm, on the other hand, involves pairwise communication between multiple computing units. The communication distance and the amount of data transmitted are halved or doubled at each step of the HD algorithm's implementation. However, for aggregate communication operations based on ring algorithms, the number of communication steps and communication latency are directly proportional to the number of computing units in the computing device cluster. Therefore, the more computing units in the cluster, the more communication steps and latency there are, resulting in lower communication efficiency. For aggregate communication operations based on the HD algorithm, computing device clusters with a power of two computing units have a theoretically optimal number of communication steps at the logarithmic level and lower communication latency. However, for computing device clusters with a non-power of two computing units, there is no theoretically optimal number of communication steps at the logarithmic level, resulting in higher communication latency. Furthermore, in each step of the HD algorithm, the communication object changes, which can easily lead to cross-switch traffic conflicts in high-traffic scenarios, resulting in lower communication efficiency. Summary of the Invention This application provides a collective communication method and a computing device cluster to improve the efficiency of collective communication. In a first aspect, embodiments of this application provide a collective communication method applied to a computing device cluster comprising N computing units, where N is a positive integer. The method includes: during the collective communication operation performed by the N computing units, a first computing unit receives first data from a second computing unit and/or sends second data to a third computing unit; wherein, when the data in each of the N computing units exceeds a first threshold, the data in each computing unit is divided into N slices, the number of slices from the second computing unit included in the first data being inversely proportional to a first physical distance, and the number of slices from the first computing unit included in the second data being inversely proportional to a second physical distance, where the first physical distance is the physical distance between the first computing unit and the second computing unit, and the second physical distance is the physical distance between the first computing unit and the third computing unit. In this embodiment, if the data in each of the N computing units is large, the data in each computing unit can be divided into N slices. For each of the N slices in the computing unit, during the execution of the aggregated communication operation, the communication step with the largest amount of communication data (i.e., the largest number of slices sent and/or received) occurs between the computing units with the smallest physical distance, and the communication step with the smallest amount of communication data (i.e., the smallest number of slices sent and/or received) occurs between the computing units with the largest physical distance. Since the smaller the physical distance between computing units, the fewer switches are needed for communication between computing units, and the larger the physical distance between computing units, the more switches are needed for communication between computing units, this reduces cross-switch traffic conflicts during the aggregated communication operation of the N computing units, thus improving the efficiency of aggregated communication. In one possible implementation, the N computing units are a