CN-121981206-A - Distributed deep learning communication scheduling optimization method based on selective gradient compression

CN121981206ACN 121981206 ACN121981206 ACN 121981206ACN-121981206-A

Abstract

The invention discloses a distributed deep learning communication scheduling optimization method based on selective gradient compression, which comprises the following steps of constructing a convergence time model, constructing gradient tensor priority indexes, constructing a candidate compression tensor set, obtaining a gradient tensor communication operation execution sequence, calculating a local gradient tensor, and compressing and communicating the gradient tensor according to the compression tensor set and the communication operation execution sequence to obtain a global synchronous gradient. According to the invention, the gradient compression subset which makes the total training time shortest is determined by optimally balancing the time consumption of single iteration and the convergence loss caused by compression, and the parallelism of the subset and calculation in the synchronous aggregation stage is improved, so that the distributed deep learning training is accelerated.

Inventors

Hong Mingfa
DONG PINGPING
TANG WENSHENG

Assignees

湖南师范大学

Dates

Publication Date: 20260505
Application Date: 20260122

Claims (7)

1. A distributed deep learning communication scheduling optimization method based on selective gradient compression is characterized by comprising the following steps: step one, constructing a single iteration time model And iteration number model Multiplying the single iteration time model and the iteration number model to construct a convergence time model : Step two, performing iterative operation on the deep learning model, and respectively obtaining norms and global norms of the corresponding gradient tensors under the conditions that the gradient tensors are compressed, the gradient tensor part is compressed and the gradient tensors are not compressed to serve as fitting experiment data, wherein the iterative frequency model is obtained through the fitting experiment data Convergence time model Fitting is carried out, and a fitted iteration number model and a fitted convergence time model are obtained; Step three, constructing gradient tensor priority indexes: Step four, minimizing a fitted convergence time model by using a tensor priority index through a greedy algorithm to obtain a set of gradient tensors compressed when the convergence time model is minimized, and constructing to obtain a candidate compression tensor set; Step five, determining the execution sequence of gradient tensor communication operation when the deep learning model is trained through a selective gradient compression scheduling algorithm; And step six, when the deep learning model is in formal training, the local working stage calculates a local gradient tensor through the deep learning model copy, and then compresses and communicates the local gradient tensor according to the candidate compression tensor set obtained in the step four and the gradient tensor communication operation execution sequence obtained in the step five, so as to obtain a global synchronous gradient.
2. The selective gradient compression-based distributed deep learning communication scheduling optimization method as claimed in claim 1, wherein: ; Wherein A, B, D, F is the coefficient to be fitted; Representing a time-averaged global gradient norm; in order for the forward propagation time to be the same, In order for the back-propagation time to be the same, The processing time of the compression process for the gradient tensor, For the compression time of the i-th tensor, For the synchronous aggregation time of the compression gradient tensor, For the synchronous aggregation time of the i-th tensor, Synchronous aggregation time, which is a non-compressive gradient tensor; Is the first Gradient norms of the compression tensors, and S is the number of compression tensors.
3. The selective gradient compression-based distributed deep learning communication scheduling optimization method according to claim 1, wherein in the third step, the method for constructing the gradient tensor priority index is as follows: ; Wherein, the A priority index for the i-th gradient tensor; For a synchronous aggregation time of the gradient tensor that decreases after compression, Processing time for the compression process of the gradient tensor; for the model of iteration times The number of iterations that increases after compression of the ith gradient tensor is obtained.
4. The distributed deep learning communication scheduling optimization method based on selective gradient compression according to claim 1, wherein in the fifth step, the selective gradient compression scheduling algorithm is as follows: S21, dividing gradient tensor scheduling operation into two-stage processing by taking a counter propagation ending time point as a boundary, taking a counter propagation process as a first stage and the rest time as a second stage; dividing the gradient tensor into a plurality of tensor blocks according to the fixed tensor block size, and synchronously aggregating the gradient tensors by taking a WFBP communication mode as an initial communication operation sequence; s22, in the first stage, according to whether the gradient tensor belongs to a compression set and the starting and ending time points of the gradient tensor synchronous aggregation, the gradient tensor is subjected to segmentation synchronous aggregation, or the operation sequence is adjusted to be parallel to the forward propagation of the next iteration in the second stage; the segmentation synchronous aggregation is to delay a part of the tensor blocks to the second stage for synchronous aggregation, and meanwhile, the gradient tensor is divided into two times for synchronous aggregation, so that communication starting overhead is generated during the second synchronous aggregation; S23, in the second stage, calculating the tensor size which is not parallel to the forward propagation calculation process in the gradient tensor synchronous aggregation process, and adjusting the tensor which is not parallel to the forward propagation calculation process to the synchronous aggregation in the time period which is not parallel to the backward propagation calculation process in the first stage, so as to obtain the final synchronous aggregation operation execution sequence.
5. The distributed deep learning communication scheduling optimization method based on selective gradient compression according to claim 4, wherein the specific method of step S22 is as follows: S221, calculating the starting and ending time points of the synchronous aggregation one by one gradient tensor according to the counter-propagation calculation process; The back propagation calculation process in step S221 sets the serial number of the gradient tensor calculated in the deepest layer as L, the serial number of the gradient tensor calculated in the shallowest layer as1, and the synchronous aggregation process starts transmission from the L-th tensor to the 1-th tensor, and sequentially calculates the starting and ending time points of synchronous aggregation of the gradient tensors according to the order of L to 1; S222, when the gradient tensor belongs to a compression set, calculating the influence of the gradient tensor before and after compression on the next gradient tensor synchronous aggregation starting time point, and when the compression operation advances the next gradient tensor synchronous aggregation starting time point, reserving the compression operation, otherwise, deleting the compression operation from the compression set; And S223, corresponding gradient tensor scheduling processing is carried out according to the parallel condition of the gradient tensor synchronous aggregation operation and the counter propagation calculation.
6. The selective gradient compression-based distributed deep learning communication scheduling optimization method according to claim 5, wherein the specific steps of step S223 are as follows: If the synchronous aggregation of the current gradient tensor is completely parallel to the back propagation calculation process of other gradient tensors, continuing to execute synchronous aggregation operation, wherein the complete parallel means that one communication operation can be just covered by one back propagation calculation process; If the current gradient tensor synchronous aggregation is parallel to the counter-propagation calculation process of the next gradient tensor of other gradient tensors at the beginning and is parallel to the counter-propagation calculation process of the next gradient tensor of the other gradient tensors at the end, calculating an unparallel time period in the forward propagation calculation process of the next iteration, and when the unparallel time period is greater than the starting overhead time of synchronous aggregation, segmenting the parallel part of the counter-propagation calculation process of the current gradient tensor and the next gradient tensor of the other gradient tensors, and delaying until the second stage is parallel to the forward propagation calculation process of the next iteration; If the current gradient tensor synchronous aggregation is parallel to the back propagation calculation process of other gradient tensors at the beginning and is parallel to the back propagation calculation process of the Kth gradient tensor after the other gradient tensors at the end, the back propagation calculation process is completely delayed to be parallel to the forward propagation process of the next iteration by the gradient tensor parallel to the current gradient tensor synchronous aggregation process, wherein K is more than or equal to 2.
7. The distributed deep learning communication scheduling optimization method based on selective gradient compression according to claim 4, wherein the specific steps of S23 are as follows: s231, calculating the gradient tensor size which is not completely parallel to the forward propagation calculation, and calculating the gradient tensor size which can be adjusted to overlap with the backward propagation calculation; And S232, according to the result of the step S231, sequentially adjusting the execution sequence of the gradient tensor communication operation from the last gradient tensor until the synchronous aggregation of all gradient tensors is completely parallel to the back propagation calculation or the forward propagation calculation.

Description

Distributed deep learning communication scheduling optimization method based on selective gradient compression Technical Field The invention relates to the technical field of information, in particular to a distributed deep learning communication scheduling optimization method based on selective gradient compression. Background With the rapid development of artificial intelligence technology, mass data such as voice, image, video and the like are continuously emerging, and abundant original data is provided for artificial intelligent algorithms and models. For example, the ImageNet dataset for the image classification task contains 1400 tens of thousands of images, involving more than 20000 categories. At the same time, the parameter scale of the deep learning model is also expanding, such as the parameter scale of microsoft theme model LightLDA exceeds 200. The enormous data size and complex model structure make conventional stand-alone training strategies a significant challenge in terms of data storage, processing, and computational power. With the continuous expansion of model size and the rapid increase of data volume, distributed deep learning becomes an important means for solving the problem of large-scale training. Distributed deep learning significantly improves training efficiency by distributing training tasks of large-scale data and complex models to multiple computing nodes, utilizing parallelization processing and resource optimization. Taking a data parallel distributed training process as an example, a large-scale data set is divided into a plurality of subsets, each subset is distributed to different computing nodes, each node independently performs forward propagation and backward propagation calculation on the distributed data subsets, gradients obtained by the backward propagation calculation are synchronized through communication operations (such as All-Reduce), global gradients are obtained, and model parameters are updated through the global gradients. By the method, the data parallelism can fully utilize the computing resources of a plurality of computers, and the challenges brought by the increase of the data scale and the model complexity are met, so that the training process of large-scale data and complex models is accelerated, and the training efficiency and the model performance are improved. Distributed deep learning is also faced with communication bottlenecks in practical applications. The communication bottleneck of distributed deep learning is mainly represented by 1) that the increase in the number of nodes causes the communication overhead to exhibit a linear increase. With the increase of distributed deep learning computing nodes, synchronization performance is not guaranteed, so that a communication bottleneck becomes the biggest bottleneck affecting training speed. 2) The amount of communication data is extremely large. For neural networks with numerous parameters, the communication duty cycle also increases, thereby more easily creating a bottleneck for distributed training. Existing methods for alleviating the communication bottleneck of distributed deep learning training include gradient compression and communication scheduling. Gradient compression methods increase training speed by reducing the total amount of communication, but inevitably lead to reduced training accuracy and additional compression computational overhead. Compared with gradient compression, the communication scheduling improves the parallel efficiency of calculation and communication in the training process by adjusting the execution sequence of the synchronous aggregation operation of the gradient tensor, however, the ratio of the calculation time to the communication time is further increased due to the rapid improvement of the GPU performance, and the training acceleration effect in certain models is limited. Disclosure of Invention The invention aims to provide a distributed deep learning communication scheduling optimization method based on selective gradient compression. In order to achieve the above purpose, the technical scheme of the invention is as follows: A distributed deep learning communication scheduling optimization method based on selective gradient compression comprises the following steps: step one, constructing a single iteration time model And iteration number modelMultiplying the single iteration time model and the iteration number model to construct a convergence time model: Step two, performing iterative operation on the deep learning model, and respectively obtaining norms and global norms of the corresponding gradient tensors under the conditions that the gradient tensors are compressed, the gradient tensor part is compressed and the gradient tensors are not compressed to serve as fitting experiment data, wherein the iterative frequency model is obtained through the fitting experiment dataConvergence time modelFitting is carried out, and a fitted iteration number model