CN-121981211-A - Fine granularity distributed training method and system based on gradient quantization sparse compression
Abstract
The invention discloses a fine granularity distributed training method and a system based on gradient quantization sparse compression, which provide a dynamic tensor fusion technology, automatically search an optimal tensor fusion buffer zone threshold by using a binary search method, merge gradient tensors of a plurality of layers into a larger buffer zone for transmission, and realize the optimal balance of communication delay and bandwidth utilization rate without manual tuning. A fine-granularity pipeline scheduling mechanism for communication decoupling is provided, a gradient synchronization process is decoupled into a sparse communication task in a reverse propagation stage and a quantized communication task in a forward propagation stage, fine granularity overlapping is realized between the sparse communication task and the quantized communication task and a forward and backward calculation task, the utilization rate of calculation resources is maximized, and communication delay is thoroughly covered. The method adopts a slicing Top-k sparsification plus AlltoAll route strategy in the back propagation stage, avoids transmission of a large number of zero values, reduces the communication complexity of back propagation, and adopts a low-bit sparsification quantization plus ALLGATHER strategy in the forward propagation stage, so that the bandwidth bottleneck problem caused by the fact that the traditional sparsification method is changed back to dense in the parameter synchronization stage is solved.
Inventors
- WU XI
- LI JIAQI
- PENG JING
- WANG TIEJUN
- GU JIQING
Assignees
- 成都信息工程大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260407
Claims (5)
- 1. The method is characterized by designing a communication decoupling fine granularity pipeline scheduling mechanism, decoupling a gradient synchronization process into a sparse communication task in a counter-propagation stage and a quantization communication task in a forward propagation stage, realizing fine granularity overlapping with a front-back propagation calculation task, adopting a slicing Top-k sparsification plus AlltoAll routing strategy in the counter-propagation, adopting a low-bit sparse quantization plus ALLGATHER strategy in the forward propagation, designing a dynamic tensor fusion technology before the counter-propagation and the forward propagation, and automatically searching an optimal tensor fusion threshold by using a binary search method, and specifically comprising the following steps of: Step 1, executing slice Top-k sparsification gradient compression and AlltoAll communication in a counter propagation stage; Introducing residual feedback and gradient correction mechanisms in a counter propagation stage, and adding the currently calculated gradient and residual left by the previous iteration to obtain a corrected gradient; performing a slicing Top-k sparsification operation on the corrected gradient, namely, each computing node only screens out the first k elements with the largest absolute values from the gradient slices responsible for the computing node to communicate with other computing nodes, and updates the unselected parts into a local residual error buffer area; exchanging sparse gradient values screened by using a Top-k algorithm and indexes of the sparse gradient values among all computing nodes by using AlltoAll aggregate communication primitives; after receiving sparse fragments from different nodes, the rear end locally accumulates to form dense gradients for model updating; Step 2, performing low-bit sparse quantization gradient compression and ALLGATHER communication in a forward propagation stage; Based on the dense gradient polymerized in the step 1, firstly compensating the current gradient parameter by utilizing the accumulated quantization error through an error feedback mechanism; dividing the gradient parameters into a plurality of groups by adopting a grouping quantization strategy, independently calculating and maintaining a scaling factor for each group, and compressing the compensated gradient parameters into a low-bit signed integer representation; Synchronizing the low-bit quantization gradient parameters and corresponding quantization metadata among the computing nodes using ALLGATHER set of communication primitives; After each node receives the quantized gradient parameters and quantization metadata, inverse quantization is performed by using the transmitted scaling factors to reconstruct the gradient parameters, and a local quantization error buffer area is updated; step 3, communication decoupling fine granularity pipeline scheduling, namely adopting communication decoupling deep overlapping scheduling in a forward propagation and reverse propagation stage, decoupling a gradient synchronization process into a sparse communication task in the reverse propagation stage and a quantized communication task in the forward propagation stage, and realizing fine granularity overlapping of communication and forward and backward calculation tasks; In the back propagation stage, utilizing the sequence dependence of deep neural network level calculation, once the gradient of a certain neural network layer is calculated and Top-k sparsification compression of the step 1 is completed, immediately triggering AlltoAll communication tasks of the neural network layer, and scheduling the AlltoAll communication tasks into independent CUDA flow to enable the CUDA flow to be executed in parallel with the back propagation calculation of the next layer; Similarly, in the forward propagation stage, the ALLGATHER synchronous tasks of the gradient parameters are scheduled in advance to be overlapped with the forward calculation tasks currently in progress, so that the required parameters are ensured to be transmitted and inversely quantized and reconstructed through the step 2 before the next layer of calculation is started; And 4, constructing a dynamic tensor fusion strategy optimized based on a binary search method, carrying out fusion treatment on gradient tensors before entering a gradient compression step, merging gradients of a plurality of continuous layers into a large buffer area for unified transmission, automatically exploring different threshold configurations in a search space through the binary search method in the initial stage of training, and monitoring the current system throughput as a feedback signal.
- 2. The fine-grained distributed training method of claim 1, wherein the data processing of step 1 specifically comprises: Before sparsification, maintaining a residual buffer area locally, and firstly calculating a correction gradient, namely, adding a gradient residual which is not transmitted in the previous round to a current real gradient when the t iteration is carried out; Step 1.2, the Top-k of the slicing is thinned, the local gradient is logically divided into n slices on the assumption that n GPUs are available, each node independently screens out the first k sparse gradient values with the largest absolute value from each slice responsible for the node, and the unselected sparse gradient values are updated back to the local residual buffer area for the compensation of the next iteration; And 1.3, carrying out Alltoall sparse gradient exchange, namely exchanging the screened sparse gradient values and indexes thereof through AlltoAll aggregate communication primitives at the bottom layer, transmitting the sparse gradient fragments to corresponding target nodes point to point by each node, and carrying out accumulation aggregation locally after receiving the sparse gradient fragments from all other nodes by a receiving end to form dense gradient updating of the fragments.
- 3. The fine-grained distributed training method of claim 2, wherein the data processing of step 2 specifically comprises: Step 2.1, quantization error compensation and grouping quantization, wherein gradient tensor parameters are divided into a plurality of subgroups, scaling factors are independently calculated for each subgroup, an error feedback mechanism is applied, namely, before quantization, a compensation value is obtained by adding the previous accumulated quantization error to the current gradient tensor parameters, and then the obtained compensation value is quantized into signed integers; And 2.2, carrying out AlGather low-bit transmission and inverse quantization, broadcasting a compressed low-bit data stream to all nodes by ALLGATHER aggregate communication primitives, carrying out inverse quantization operation to reconstruct high-precision gradient parameters after each node receives compression gradient data and quantization metadata, and calculating a current quantization error and storing the current quantization error into an error buffer area.
- 4. The fine-grained distributed training method according to claim 3, wherein the specific processing procedure for outputting the optimal fusion threshold based on the tensor fusion threshold optimization algorithm of the binary search method in the step 4 is as follows: firstly, setting a minimum threshold value and a maximum threshold value, searching iteration times K, and evaluating a function of throughput; initializing a search space, setting the lower bound of the current search as a minimum threshold value and the upper bound as a maximum threshold value; Entering a search loop, and calculating a midpoint and a detection point when the iteration number is smaller than K and the search lower bound is smaller than the search upper bound; Evaluating system performance, and respectively recording average throughput of the system based on the midpoint and the detection point; Adjusting a search interval, and updating a search lower bound or a search upper bound according to the average throughput recorded respectively; And (5) stopping and outputting, and returning to the midpoint of the current interval as an optimal threshold after the circulation is finished.
- 5. The fine granularity distributed training system based on gradient quantization sparse compression is characterized by being used for realizing a fine granularity distributed training method based on gradient quantization sparse compression, and comprises a distributed optimization layer, a tensor fusion layer, a gradient compression and set communication layer, and specifically: The distributed optimization layer is used for fine-granularity pipeline scheduling of communication decoupling, depth overlapping scheduling of communication decoupling is adopted in forward propagation and backward propagation stages, and a gradient synchronization process is decoupled into sparse communication tasks in the backward propagation stage and quantized communication tasks in the forward propagation stage, so that fine granularity overlapping of communication and forward and backward calculation tasks is realized; The tensor fusion layer is used for merging gradients of a plurality of continuous layers into a large buffer area for unified transmission, automatically exploring different threshold configurations in a search space through a binary search method at the initial stage of training to obtain an optimal tensor fusion buffer area threshold value, and monitoring the current system throughput as a feedback signal; The gradient compression layer is used for executing the sparse gradient compression of the slice Top-k in the backward propagation stage and executing the sparse quantization gradient compression of the low bit in the forward propagation stage; The set communication layer uses AlltoAll set communication primitives to exchange sparse gradient values and indexes thereof screened by a Top-k algorithm among all computing nodes, and uses ALLGATHER set communication primitives to synchronize forward-propagating low-bit quantization gradient parameters and corresponding quantization metadata among the computing nodes.
Description
Fine granularity distributed training method and system based on gradient quantization sparse compression Technical Field The invention relates to the technical field of large language model distributed training, in particular to a fine granularity distributed training method and system based on gradient quantization sparse compression. Background With the explosive growth of deep learning technology in the fields of computer vision, natural language processing and the like, the scale of model parameters and the size of training data sets are exponentially increased. A single GPU graphics card cannot meet the training requirement of a large-scale deep neural network, so that distributed training based on multiple GPU clusters becomes an industry-wide solution. Data parallelism is the most common parallel mode, and all nodes must perform global synchronization of gradients before updating parameters in order to ensure model convergence. The most classical synchronization method uses AllReduce aggregate communication primitives, however, as the cluster size increases, the communication overhead caused by AllReduce increases dramatically, and gradually becomes a core bottleneck for limiting the training speed. To solve the communication bottleneck, researchers first make a lot of optimization from the system architecture level. For example, the Ring-AllReduce algorithm proposed by hundred degrees, the Vogel et al proposed a wait independent back propagation (WFBP) strategy, and Shi et al introduced a Tensor Fusion technique. Although system level optimization improves throughput, in very large scale model training, the amount of data transferred is still enormous. Therefore, research on the gradient compression technology with gradually turning the center of gravity is mainly divided into two directions of sparsification and quantization. Existing distributed training schemes primarily address combining the compression techniques described above with communication primitives. One mainstream scheme is TopKAllGather, and this approach, while retaining important gradient information, requires the transmission of additional index data since the standard AllReduce operator does not support direct aggregation of sparse data. Another improvement is decoupling the communication strategy (e.g., deAR) which disassembles AllReduce into ReduceScatter and ALLGATHER, and schedules them separately to achieve overlap with the computation. As the technology most similar to the invention, topKA A algorithm is innovated on the basis. The method utilizes AlltoAll communication primitives to exchange sparse gradient slices in a reverse propagation stage, avoids the full broadcast characteristic of ALLGATHER, and tries to reduce the traffic in the reverse stage. In addition, some of the front-end attempts use thinning and quantization in combination by JointSQ or the like to attempt to obtain compression benefits of both, but there is a certain disadvantage in communication efficiency. The main disadvantages of the prior art are as follows. 1. Bandwidth and index bottlenecks in large scale expansion of sparse communications. The existing Top-k sparsification scheme relies heavily on ALLGATHER operators to synchronize the sparse gradient and index. The overhead of this communication mode grows linearly with the number of GPUs in the cluster. When the number of the nodes is large, the sparse slices of all the nodes are aggregated to cause the total data volume to be increased rapidly, and even compression bonus caused by sparsification is counteracted, so that the expansibility of the sparse slice on a large-scale cluster is poor. 2. The staged compression is not complete, resulting in a significant forward propagation communication overhead. Although the method represented by TopKA a realizes sparse communication by AlltoAll in the back propagation phase, these sparse slices are again restored to a dense state after node local aggregation. This results in that in the subsequent forward propagation phase (typically using ALLGATHER synchronization parameters), the system must transmit uncompressed or only slightly compressed dense data. The end-to-end communication costs of the overall training remain high because the forward phase communication cannot be fully compressed. 3. Very low bit quantization suffers from a breakdown in accuracy in the absence of system level coordination. Although quantization techniques exist, when aggressive compression is attempted using very low bits (e.g., 2-bit or 4-bit), existing quantization methods often introduce large numerical errors, resulting in models (especially large language models that are sensitive to precision) that fail to converge or suffer a significant decrease in generalization ability. In addition, the existing scheme often carries out quantization and sparse fracture treatment, lacks a mechanism capable of fusing very low bit quantization and pipeline scheduling depth, and cannot