US-12626137-B1 - Gradient exchange using adaptable hyper-rectangle

US12626137B1US 12626137 B1US12626137 B1US 12626137B1US-12626137-B1

Abstract

Systems and methods are provided to configure M number of processing nodes as a network represented as a hyper-rectangle with N dimensions. Configuration may include determining N numeric factors of M and assigning a number of processing nodes in each of the N dimensions to correspond to one of the N numeric factors. Each processing node can compute a set of data. The set of data can be exchanged among the M number of processing nodes using a scatter-reduce operation according to a sequence of the N dimensions, and an all-gather operation according to a reverse order of the sequence of the N dimensions.

Inventors

Thiam Khean Hah
Yongseok Koh

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20210331

Claims (19)

1 . A computer-implemented method for configuring M number of on-chip collective parameter server nodes of a distributed neural network training system to perform a training process of a neural network, the method comprising: determining N prime factors of M, wherein M is factorable into a non-prime factor that is greater than one and less than M; configuring the M number of on-chip collective parameter server nodes as a network represented as a hyper-rectangle with N dimensions, wherein a number of on-chip collective parameter server nodes in each of the N dimensions is one of the N prime factors; identifying two or more dimensions of the hyper-rectangle having a combined number of on-chip collective parameter server nodes being less than a node limit; collapsing the identified two or more dimensions of the hyper-rectangle into a single dimension having the combined number of on-chip collective parameter server nodes; initiating the M number of on-chip collective parameter server nodes to perform a training process, wherein each of the M on-chip collective parameter server nodes performs forward and backward propagation to compute a set of weight gradients as part of training the neural network; triggering a scatter-reduce operation to exchange respective sets of weight gradients among the M number of on-chip collective parameter server nodes according to a sequence of dimensions to generate a set of reduced weight gradients; and triggering an all-gather operation to exchange respective sets of reduced weight gradients among the M number of on-chip collective parameter server nodes according to a reverse order of the sequence of dimensions.
2 . The computer-implemented method of claim 1 , wherein collapsing the identified two or more dimensions of the hyper-rectangle into the single dimension is further based on one or more of network topology, bandwidth of each link in the network, latency of each link in the network, or characteristics of connectivity between the collective parameter server nodes and overall network comprising the M number of collective parameter server nodes.
3 . The computer-implemented method of claim 1 , wherein a respective set of collective parameter server nodes on each edge in a given dimension is assigned to one of the N prime factors based on network topology, bandwidth of each link in the network, or latency of each link in the network.
4 . The computer-implemented method of claim 1 , wherein each on-chip collective parameter server node includes a hierarchy of on-chip collective parameter server nodes, wherein each on-chip collective parameter server node in a hierarchical level is capable of performing distributed training within same hierarchical level.
5 . A method for configuring M number of processing nodes of a distributed computing system to perform data exchange operations for training a neural network, the method comprising: determining N prime factors based on the M number of processing nodes, wherein M is factorable into a non-prime factor that is greater than one and less than M; configuring the M number of processing nodes as a network represented as a hyper-rectangle with N dimensions, wherein a number of processing nodes in each of the N dimensions is one of the N prime factors; identifying two or more dimensions of the hyper-rectangle having a combined number of processing nodes being less than a node limit; collapsing the identified two or more dimensions of the hyper-rectangle into one dimension having the combined number of processing nodes; initiating each processing node in the M number of processing nodes to perform forward and backward propagation of the neural network to generate a respective set of data for exchange during the training process; and triggering a data exchange operation to exchange respective sets of data among the M number of processing nodes configured as the hyper-rectangle by: triggering a scatter-reduce operation to exchange respective sets of data according to a sequence of dimensions to generate a set of reduced data; and triggering an all-gather operation to exchange respective sets of reduced data according to a reverse order of the sequence of dimensions.
6 . The method of claim 5 , wherein the identified two or more dimensions of the hyper-rectangle are collapsed further based on one or more of network topology, bandwidth of each link in the network, latency of each link in the network, or network characteristics.
7 . The method of claim 5 , wherein determining the N prime factors based on the M number of processing nodes includes: determining that an initial number of processing nodes is a prime number; adding one or more additional processing nodes to the initial number of processing nodes to reach a non-prime number of processing nodes; and determining the N prime factors based on M being the non-prime number of processing nodes.
8 . The method of claim 7 , wherein the one or more additional processing nodes are added to balance bandwidth of the network, and wherein the one or more additional processing nodes include one or more regular processing nodes, or one or more collective compute parameter servers.
9 . The method of claim 5 , wherein a respective placement of the processing nodes in each dimension results in minimizing the maximum bandwidth.
10 . The method of claim 5 , wherein a respective placement of the processing nodes in each dimension results in minimizing the minimum bandwidth.
11 . The method of claim 5 , wherein a placement of the processing nodes on a same edge for a given dimension is based on balancing a respective speed of all edges for the given dimension.
12 . The method of claim 5 , wherein each processing node in the M number of processing nodes computes a set of weight gradients to be exchanged for the training process of the neural network, and wherein the data exchange operation includes a gradient exchange process.
13 . The method of claim 12 , wherein the method further includes re-configuring the M number of processing nodes upon completion of the gradient exchange process of a portion of the set of weight gradients.
14 . The method of claim 12 , wherein the scatter-reduce operation is performed for a dimension with a fewer number of processing nodes before a dimension with a greater number of processing nodes.
15 . The method of claim 5 , wherein each processing node includes multiple neural network processors.
16 . The method of claim 5 , wherein the scatter-reduce operation for a dimension is performed in parallel with processing nodes of an orthogonal dimension.
17 . A non-transitory computer readable medium having stored therein instructions that, when executed by one or more processors, cause the one or more processors to perform a method comprising: determining N prime factors based on M number of processing nodes of a distributed computing system to perform a data exchange operation for training a neural network, wherein M is factorable into a non-prime factor that is greater than one and less than M; configuring the M number of processing nodes as a network represented as a hyper-rectangle with N dimensions, wherein a number of processing nodes in each of the N dimensions is one of the N prime factors; identifying two or more dimensions of the hyper-rectangle having a combined number of processing nodes being less than a node limit; collapsing the identified two or more dimensions of the hyper-rectangle into one dimension having the combined number of processing nodes; initiating each processing node in the M number of processing nodes to perform forward and backward propagation of the neural network to generate a respective set of data for exchange during a training process; and triggering a data exchange operation to exchange respective sets of data among the M number of processing nodes by: triggering a scatter-reduce operation to exchange respective sets of data according to a sequence of dimensions to generate a set of reduced data; and triggering an all-gather operation to exchange respective sets of reduced data according to a reverse order of the sequence of dimensions.
18 . The non-transitory computer readable medium of claim 17 , wherein the identified two or more dimensions of the hyper-rectangle are collapsed into one dimension further based on one or more of network topology, bandwidth of each link in the network, latency of each link in the network, or network characteristics.
19 . The non-transitory computer readable medium of claim 17 , wherein the scatter-reduce operation for a dimension is performed in parallel with processing nodes of an orthogonal dimension.

Description

BACKGROUND Neural networks can be used to perform tasks such as recognizing an object in an image. In a neural network, input data is combined with weights to derive output data using activation functions. For example, a neural network may take an image as input data, and output a decision or likelihood that a certain object is in the image. The set of weights used in a neural network can be determined by a training process, in which the neural network can learn how to perform a certain computing task for an application. The training process involves supplying a neural network model with training input data and a corresponding reference output which supports a particular decision (e.g., a detection or a non-detection of an object in an image). The neural network can perform computations to combine the weights with the training input data to generate training output data, and the training output data can be compared against the reference output data to assess the accuracy of the neural network model. During training, different training input data sets can be provided to generate different training output data sets. The weights of the neural network can be adjusted to minimize the differences between the training output data and the reference output data. To improve the likelihood of the neural network generating a correct decision, a large volume of training input data covering a large number of scenarios can be used to train the neural network. As a result, training a neural network may take a lot of time and computational resources. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 illustrates an example of a training process to train a neural network; FIGS. 2A-2F illustrate a gradient exchange process based on the ring network topology to exchange a set of weight gradients among 3 processing nodes; FIGS. 3A-3B illustrate steps for gradient exchange for different dimensions of hypercubes; FIGS. 4A-4E illustrate example hyper-rectangles for different numbers of processing nodes; FIG. 5 illustrates a flow chart, which can be used to describe an algorithm to perform data exchange among processing nodes of an N-dimensional hyper-rectangle using all-reduce operations, according to certain embodiments; FIGS. 6A-6C illustrate a gradient exchange process based on the algorithm to exchange a set of weight gradients among 3 processing nodes configured as collective parameter server nodes; FIGS. 7A-7F illustrate a gradient exchange process performed by a 2×3 hyper-rectangle based on the algorithm, according to certain embodiments; FIG. 8 illustrates a process, which can be used to describe performing the gradient exchange process in parallel on all the dimensions of a hyper-rectangle, according to certain embodiments; FIGS. 9A-9D illustrate different configurations for a hyper-rectangle according to certain embodiments; FIG. 10 illustrates an example of internal components of a computing device that can be used as a processing node of a distributed computing system, in certain embodiments; FIG. 11 is a block diagram illustrating an example of an integrated circuit device; FIG. 12 includes a block diagram that illustrates an example of an acceleration engine; FIG. 13 includes a block diagram that illustrates an example of a host system; FIG. 14 illustrates a flow chart illustrating an example of a method to configure a plurality of processing nodes as a network represented as a hyper-rectangle with N dimensions, according to certain embodiments; FIG. 15 illustrates a flow chart for a computer-implemented method for distributed training in a system of processing nodes configured as a hyper-rectangle network of N-dimensions, according to certain embodiments; and FIG. 16 includes a diagram of an example network. DETAILED DESCRIPTION A neural network typically includes a number of cascading neural network layers each associated with a set of weights. In an inference operation, a first neural network layer can receive an input data set, combine the input data set with the weights (e.g., by multiplying the input data set with the weights and then summing the products) to generate a first output data set for the layer, and to propagate the output data set to a second neural network layer in a forward propagation operation. The second neural network layer performs another set of forward propagation operations on the first output data set from the first layer to generate a second output data set, and to propagate the second output data set to higher neural network layers. The forward propagation operations can start at the first neural network layer and end at the highest neural network layer. The forward propagation operations at each neural network layer can represent different stages of extraction and processing of information from the input data set. A decision can then be made based on the output data of the hi