CN-117376284-B - Distributed machine learning gradient synchronization method and system based on in-network computing

CN117376284BCN 117376284 BCN117376284 BCN 117376284BCN-117376284-B

Abstract

The invention discloses a distributed machine learning gradient synchronization method based on intra-network computing, which comprises the steps of constructing a distributed machine learning cluster, constructing a training set and distributing a training set subset for each computing node, constructing a convolutional neural network at each computing node, training the convolutional neural network at the current computing node by utilizing the training set subset distributed by each computing node to generate one gradient data on each computing node, carrying out data mixing synchronization on gradient data blocks on all computing nodes by utilizing a programmable switch and a server so as to enable the gradient data on all computing nodes to be integrated into complete gradient data, updating parameters of the convolutional neural network by utilizing the complete gradient data, and iteratively training the convolutional neural network to obtain a trained convolutional neural network model. The invention can greatly reduce the data quantity transmitted in the link, relieve the congestion problem at the network card of the server node and improve the communication efficiency of the server synchronous mode.

Inventors

YU XIAOSHAN
Ren Zeang
GU HUAXI
WANG JIAKUN
WANG KUN

Assignees

西安电子科技大学

Dates

Publication Date: 20260505
Application Date: 20220628

Claims (7)

1. A distributed machine learning gradient synchronization method based on intra-network computing, comprising: S1, constructing a distributed machine learning cluster, wherein the distributed machine learning cluster comprises a programmable switch and a plurality of servers respectively connected to the switch, and each server is configured with a plurality of computing nodes; S2, constructing a training set and distributing a training set subset for each computing node of each server; S3, constructing a convolutional neural network at each computing node, and training the convolutional neural network at the current computing node by utilizing a training set subset distributed by each computing node so as to generate gradient data on each computing node; S4, carrying out data mixing synchronization on gradient data blocks on all computing nodes by utilizing the programmable switch and the server so that each computing node obtains complete gradient data after gradient data on all computing nodes are fused; S5, updating parameters of each convolutional neural network by using the complete gradient data; S6, iteratively training the convolutional neural network to obtain a trained convolutional neural network model; the step S4 comprises the following steps: s41, performing reduction-dispersion on gradient data blocks on all computing nodes so that each computing node in each server obtains gradient data fused at all nodes in the current server; S42, carrying out data slicing on the gradient blocks fused by each computing node, obtaining a plurality of data packets and sending the data packets into aggregation slots corresponding to the programmable switch; s43, data aggregation is carried out on the received data packets of the computing nodes by utilizing aggregation slots corresponding to the programmable switch, and the aggregated data are broadcast to the corresponding computing nodes of all servers; s44, after receiving the aggregated data, the corresponding computing nodes in all the servers cover and replace the original data in the current server; s45, carrying out total aggregation on the current data of all the computing nodes in each server so as to enable each computing node in each server to obtain complete gradient data after gradient data fusion on all the computing nodes; further, the S43 includes: s43a, setting a counter B for each slot of the programmable switch and setting an initial value of 0; s43b, when the programmable switch receives the data packet, the vector of the data packet is transmitted to the programmable switch Aggregate to Chi Suoyin of the data packet In the addressed slots, after each time the corresponding slot receives a data packet from one server, adding the vector in the current data packet to the vector already aggregated in the slot, and adding 1 to the value of the counter B of the slot; And S43c, when the counter value of a certain slot reaches the number of servers, the programmable switch updates the vector in the original data packet p by using the aggregated data in the slot and broadcasts the updated data packet to each server, and simultaneously sets the aggregated vector and the counter B of the current slot to zero.
2. The distributed machine learning gradient synchronization method based on intra-network computing of claim 1, wherein S2 comprises: s2a, selecting a plurality of picture samples formed by pictures, wherein each picture comprises a target type label, and forming a training set by the picture samples and the corresponding target type labels; s2b, dividing the training set into training set subsets with the same number of times as the total number of the computing nodes, and distributing an independent training set subset for each computing node.
3. The distributed machine learning gradient synchronization method based on intra-network computing of claim 1, wherein the convolutional neural network at each computing node comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a third pooling layer, a fourth convolutional layer, and a fully-connected layer, which are sequentially connected.
4. The distributed machine learning gradient synchronization method based on intra-network computing of claim 1, wherein S41 comprises: S41a, numbering the computing nodes in each server , wherein, Representing an nth computing node in an s-th server, wherein N represents the total number of nodes of each server; s41b, dividing the gradient data set on each computing node into N gradient blocks, and numbering each gradient block as ; S41c, setting left and right neighbors of a computing node in each server, wherein the left neighbor is a previous computing node of a current computing node, the right neighbor is a next computing node of the current computing node, the right neighbor of a last computing node is a first computing node, and the left neighbor of the first computing node is a last computing node; S41d, executing N-1 data transfer, and numbering at the ith data transfer Receiving a code number from its left neighbor And simultaneously transmits its number as gradient block Wherein i has an initial value of 1 and ranges from ; S41e, number is given Gradient blocks received after N-1 data transfers in a compute node and numbered Adding up the gradient blocks of (2); And S41f, enabling i to be added from 0 to 7, repeatedly executing steps S41 c-S41 e after adding each time, and finally obtaining gradient blocks at all nodes in the current server by each computing node in each server.
5. The distributed machine learning gradient synchronization method based on intra-network computing of claim 4, wherein S42 comprises: S42a, dividing the gradient blocks fused at each computing node into a plurality of fragments, wherein the size of each fragment is consistent with the size of a vector in each slot in the programmable switch; S42b, constructing each segment into a data packet, wherein the data packet comprises segmented data vectors and carries a pool index idx and a calculation node number z; And S42c, controlling the first computing node of each server to send data packets with the same number as the number of pools in the programmable switch, and sending each data packet to the aggregation slots with the same index numbers of the programmable switch according to the index sequence.
6. The distributed machine learning gradient synchronization method based on intra-network computing of claim 5, wherein S44 comprises: S44a, the first computing node in all servers receives the aggregate data packet from the programmable switch according to the offset of the aggregate data packet A field for inserting the vector in the aggregate data packet into the first gradient data segment at a position corresponding to the offset; S44b, adding the aggregate data packet offset Searching vector corresponding to current offset and constructing new data packet If the offset value of the new data packet is smaller than the data quantity of one gradient block, repeating the steps S42c-S42d, otherwise, not processing the data packet until the value of the counter A becomes the number of fragments contained in each gradient block, then all the data packets in the gradient blocks in the current computing node are sent completely, and setting the value of the counter A to 0; all servers will perform steps S42, S43 and S44a-S44d for the gradient data of the second gradient block of the second computing node, the third gradient block of the third computing node, the nth gradient block of the nth computing node in that order.
7. A distributed machine learning gradient synchronization system based on intra-network computing for performing the distributed machine learning gradient synchronization method of any one of claims 1 to 6, the system comprising a programmable switch and a plurality of servers respectively connected to the switch, each server being configured with a plurality of computing nodes, each computing node being provided with a convolutional neural network.

Description

Distributed machine learning gradient synchronization method and system based on in-network computing Technical Field The invention belongs to the technical field of distributed machine learning, and particularly relates to a distributed machine learning gradient synchronization method and system based on intra-network computing, which can effectively reduce communication overhead of a distributed machine learning system. Background In recent years, the scale of machine learning models and data sets applied to the fields of image recognition, natural language processing and the like is dramatically increased, so that the training time cost of the neural network model is greatly increased. Therefore, the distributed machine learning fully utilizes the computing resources in the cluster, expands the training of the neural network model from single-machine single node to multiple machines and multiple nodes, and can effectively reduce the training time. However, with the increase of the scale of the distributed machine learning, the communication overhead generated by 1 synchronization among the multiple nodes is aggravated, and the acceleration effect of the distributed machine learning is seriously affected. The distributed classical gradient synchronization mode is based on both a parameter server (PARAMETER SERVER) and a full specification (allreduce). Along with the increase of the parameter data volume and the expansion of the scale of the working node, the congestion problem is very easy to occur at the side of the parameter server. In each round of training, all working nodes need to upload the trained gradient to the parameter server node, a large amount of gradient data can generate backlog at the parameter server node, and network congestion can be caused by hot spot traffic. In the iterative process of distributed machine learning, a large amount of burst flows reach a parameter server in a few milliseconds, and the problems of packet loss, congestion, load unbalance and the like can occur in a network, so that the completion time of the whole distributed application is increased. Byte jitter in 2019 resulted in a distributed training framework BytePS based on a parametric server architecture. The framework utilizes the additional CPU resource as a parameter server, thereby fully improving the communication performance. However, if the parameter server in BytePS were to achieve theoretical communication efficiency, the same number of parameter server nodes as the working nodes would need to be introduced. Furthermore, when BytePS is extended to a distributed multi-machine cluster, additional CPU servers need to be configured to act as parameter server nodes. And, the working nodes from multiple servers periodically upload gradients to the parameter server, which also creates a certain network bottleneck. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a distributed machine learning gradient synchronization method and system based on intra-network computing, which aim to utilize programmable equipment to bear partial gradient synchronization and relieve communication pressure caused by synchronization of a large number of gradients so as to improve system throughput. The technical problems to be solved by the invention are realized by the following technical scheme: One aspect of the present invention provides a distributed machine learning gradient synchronization method based on intra-network computing, comprising: S1, constructing a distributed machine learning cluster, wherein the distributed machine learning cluster comprises a programmable switch and a plurality of servers respectively connected to the switch, and each server is configured with a plurality of computing nodes; S2, constructing a training set and distributing a training set subset for each computing node of each server; S3, constructing a convolutional neural network at each computing node, and training the convolutional neural network at the current computing node by utilizing a training set subset distributed by each computing node so as to generate gradient data on each computing node; S4, carrying out data mixing synchronization on gradient data blocks on all computing nodes by utilizing the programmable switch and the server so that each computing node obtains complete gradient data after gradient data on all computing nodes are fused; S5, updating parameters of each convolutional neural network by using the complete gradient data; And S6, iteratively training the convolutional neural network to obtain a trained convolutional neural network model. In one embodiment of the present invention, the S2 includes: s2a, selecting a plurality of picture samples formed by pictures, wherein each picture comprises a target type label, and forming a training set by the picture samples and the corresponding target type labels; s2b, dividing the training set into training set subsets with th