CN-115204376-B - Computing system and computing method for multi-processing unit interconnection
Abstract
The present disclosure provides a computing system and computing method for multi-processing unit interconnection, the computing system including one or more parallel processing unit groups, a plurality of parallel processing units in the parallel processing unit groups being organized into subsets of parallel processing units, in the same subset, each parallel processing unit being configurably coupled to parallel processing units of two nearest neighbors by two communication links, and each parallel processing unit being configurably coupled to a furthest neighbor parallel processing unit by one communication link, and each parallel processing unit being configurably coupled to a respective parallel processing unit of the other subset by two communication links. The method and the device process functions of the communication modes among the nodes through the chipset, such as the Reduce function, the all_reduce function and the like, so that the bandwidth utilization rate and the efficiency of the communication among the chips are improved.
Inventors
- HAN LIANG
Assignees
- 平头哥(上海)半导体技术有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20220217
- Priority Date
- 20210329
Claims (6)
- 1. A computing system, comprising: one or more parallel processing unit groups, wherein each parallel processing unit group comprises eight parallel processing units, wherein eight parallel processing units of the parallel processing unit group are organized in two subsets, each subset comprising four parallel processing units, in the same subset, each parallel processing unit being configurably coupled to two nearest neighbor parallel processing units by two communication links, and each parallel processing unit being configurably coupled to a furthest neighbor parallel processing unit by one communication link, and each parallel processing unit being configurably coupled to a respective parallel processing unit of the other subset by two communication links; Wherein the eight parallel processing units of the parallel processing unit group are configured according to one of the following ways: Configured as a computation group, coupling together eight parallel processing units of the group of parallel processing units through three communication rings; configured into two compute groups, each compute group comprising four parallel processing units, and coupling the four parallel processing units of each compute group together through two communication rings; configured into four compute groups, the compute groups comprising two parallel processing units, and coupling the two parallel processing units of each compute group together through one communication ring; Configured as a first computation group comprising four parallel processing units and two second computation groups comprising two parallel processing units, the four parallel processing units of the first computation group being coupled together by two communication rings, the two parallel processing units of each second computation group being coupled together by respective communication rings; wherein the communication link comprises a bi-directional communication link.
- 2. The computing system of claim 1, wherein the communication links of a parallel processing unit group are configured into one or more computing groups based on the computing parameters, the computing groups including a respective plurality of communication loops.
- 3. The computing system of claim 2, wherein each of the one or more computing groups is configured to compute a respective Reduce or all_reduce function on the respective input data using a parallel loop Reduce or all_reduce algorithm.
- 4. The computing system of claim 2, wherein the computing parameters include a number of computing groups of parallel processing units and an amount of computing processing bandwidth.
- 5. A computing method, comprising: configuring the communication links of the parallel processing unit group into one or more computation groups based on the computation parameters, the computation groups comprising a corresponding plurality of communication loops, and Computing a function on the input data by the computing group using a parallel communication loop algorithm; Wherein the parallel processing unit group comprises eight parallel processing units organized in two subsets, each subset comprising four parallel processing units, the parallel processing unit group further comprising in each subset two bi-directional communication links between each parallel processing unit and the nearest neighbor set, one bi-directional communication link between each parallel processing unit and the farthest neighbor set, and two bi-directional communication links between corresponding parallel processing units of the two subsets; wherein configuring the communication links of the parallel processing unit groups into one or more compute groups comprises: Configuring the bi-directional communication link as three parallel communication loops coupling the eight parallel processing units into one compute group; The computing method further comprises the following steps: dividing the input data into six data groups and loading the corresponding data group pairs into corresponding parallel processing units, or Wherein configuring the communication links of the parallel processing unit groups into one or more compute groups comprises: Configuring the bi-directional communication link as a set of parallel communication loops, the set of parallel communication loops comprising two parallel communication loops, the set of parallel communication loops coupling respective subsets into one or more respective sets of computation; The computing method further comprises the following steps: The input data is divided into four data sets and corresponding pairs of data sets are loaded into corresponding parallel processing units of a computation group consisting of four parallel processing units.
- 6. The computing method of claim 5, wherein the computing parameters include a number of computing groups of parallel processing units and an amount of computing processing bandwidth.
Description
Computing system and computing method for multi-processing unit interconnection Technical Field The present disclosure relates to the field of information processing and communication in chip interconnect networks, and more particularly to computing systems and computing methods for multi-processing unit interconnect. Background Currently, a method of distributed parallel training of deep neural networks includes deploying a large-scale small-batch (minibatch) random gradient descent (Stochastic GRADIENT DESCENT, SDG) process performed simultaneously onto a plurality of distributed compute nodes to explore data parallel-based acceleration (acceleration). Referring to FIG. 1, FIG. 1 illustrates an exemplary small batch random gradient descent process running on a CPU host and Pseudo Code. Small batches of random gradient descent processing are subject to a synchronization section, which creates a bottleneck for the overall parallel acceleration process. As shown in fig. 2, to reduce bottlenecks, it is desirable to increase the bandwidth of the accelerator-side network and/or to reduce the frequency of communications between the host and the accelerator. There are many synchronization algorithms for small batch random gradient descent processing. Some common functions that implement the communication mode between computing nodes are the Reduce function and the all_reduce function. Referring now to FIG. 3, FIG. 3 is used to illustrate the Reduce function. In the Reduce function, a set of values for each of the plurality of nodes 310-340 is passed to a given node 310 of the plurality of nodes 310-340, the given node 310 adding the corresponding values together. A given node 310 stores the sum of the sets of values. For example, node 310 receives values 5, 2, 7, and 4 from multiple nodes 310-340, node 310 adds the received values 5, 2, 7, and 4 together, and node 310 stores the resulting sum 18. Node 310 also adds together the values 1, 3, 8, and 2 and stores the resulting sum 14. Referring now to FIG. 4, FIG. 4 is used to illustrate the all_reduce function. In the all_reduce function, the set of values for each of the plurality of nodes 410-440 is passed to a given node 410 of the plurality of nodes 410-440, the given node 410 adding the corresponding values together. The sum value set is broadcast by a given node 410 to a plurality of nodes 410-440, and the plurality of nodes 410-440 store the sum value set. For example, node 410 adds together the values 5, 2, 7, and 4 received from the plurality of nodes 410-440. Node 410 also adds the values 1, 3, 8, and 2 together. Node 410 broadcasts a sum value set 18 and 14 to a plurality of nodes 410-440, each of the plurality of nodes 410-440 storing a sum value set 18 and 14. As indicated above, the Reduce function and the all_reduce function are applied to a batch of variables simultaneously. While simple topology implementations of the Reduce function and all_reduce function are tree-based implementations, ring-based implementations may achieve higher bandwidth utilization and efficiency. Referring now to FIG. 5, FIG. 5 is a diagram illustrating a conventional ring-based all_reduce function implemented in a distributed computing system. In the all_reduce function, each of the N nodes of the distributed computing system communicates 2 x (N-1) times with its two peer nodes. During communication, the node transmits and receives sets of values. In the first (N-1) iteration, the received value is added to the value in the buffer of the corresponding node. In the second (N-1) iteration, the value held in the buffer of the corresponding node is replaced with the received value. For example, three nodes (n=3) are shown in 510 of fig. 5, each node buffering a respective set of input values. In a first iteration 520, the first node passes the first set of input values to the second node. The second node adds the set of input values received from the first node to the corresponding input values maintained by the second node. The first node also receives a third set of input values from a third node. The first node adds the set of input values received from the third node to the corresponding values maintained by the first node. In the first iteration 520, the second node and the third node also pass and add the corresponding sets of input values. In a second iteration 530, the first node passes the third set of input values to the second node, which adds the third set of input values to the corresponding values held by the second node. The first node also receives a second set of values from the third node, and the first node adds the second set of values to the corresponding values maintained by the first node. In a second iteration 530, the second node and the third node again pass and add the corresponding sets of values. In a third iteration 540, the first node passes the second set of sums to the second node, which stores the second set of sums. The first node also receiv