CN-122019191-A - Efficient heterogeneous pipeline parallel training method for commercial server and large language model

CN122019191ACN 122019191 ACN122019191 ACN 122019191ACN-122019191-A

Abstract

The invention belongs to the field of parallel training of large language models, in particular to a high-efficiency heterogeneous pipeline parallel training method for a commercial server and a large language model, which comprises offline performance data acquisition, offline model segmentation, preparation and initialization and online training stages; in the online training stage, a pre-fetching perception layer packet division strategy is adopted to divide the model into layer packets, so that the pre-fetching expense is completely covered and the occupation of the GPU memory is optimized, and the limitation of the CPU memory capacity on the size of the trainable model is avoided by removing the redundancy parameters and the memory multiplexing strategy in the CPU memory. In addition, asynchronous parameter prefetching and CPU parameter updating are achieved by asynchronously performing exchange of data between heterogeneous devices in layer units. The invention obviously reduces the occupation of the GPU and CPU memory during training and improves the training performance.

Inventors

DING YUQUAN
SHAO JIE

Assignees

电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (9)

1. A high-efficiency heterogeneous pipeline parallel training method for a commercial server and a large language model is characterized by comprising the following steps: S1, offline performance data acquisition, namely splitting the large language model by taking an original layer as a unit, and acquiring performance parameters of each original layer on a GPU, wherein the performance parameters comprise forward calculation time, reverse calculation time and data transmission time from an external storage medium to a GPU memory; S2, cutting an offline model, namely determining the size of a layer packet by adopting a pre-fetching perception layer packet division strategy based on the performance data acquired in the step S1, and combining a plurality of continuous original layers into the layer packet according to the size of the layer packet to serve as a basic calculation unit in training; S3, memory management configuration, namely selecting one of the following two configurations to manage model parameter storage, namely, the first configuration to store the complete model parameters in a CPU memory; s4, training initialization, namely distributing GPU memory according to the layer packet size determined in the step S2, and registering a hook function for triggering parameter prefetching, memory recovery and gradient unloading for each layer of the model; S5, pipeline execution, namely sequentially executing forward computation, re-computation and reverse computation on the GPU by taking the layer packet as a unit in the training process, wherein the current layer packet is executed Asynchronous triggering of the next layer packet at the computation of (a) Pre-fetching parameters of the configuration selected in the step S3 from a storage position corresponding to the configuration to a GPU memory; S6, gradient asynchronous unloading, namely immediately performing reverse calculation on an ith original layer in the layer packet by using the GPU to generate a gradient The gradient of the layer is asynchronously unloaded to the CPU memory; and S7, asynchronously updating parameters, namely asynchronously executing the parameter updating of the corresponding original layer after the CPU memory receives the gradient.
2. The method of claim 1, wherein the pre-fetching aware layer packet partitioning strategy in step S2 comprises dynamically sizing layer packets based on the collected raw layer performance data, wherein the sizing of layer packets satisfies that, in an ideal case, the current layer packet Forward computation time on GPU is not less than the next layer packet The time required for transfer of the parameters from the external storage medium to the GPU memory to achieve overlap of the computation with the prefetched data transfer.
3. The method of claim 2, wherein the step S2 of dynamically setting the layer packet size and combining the plurality of consecutive original layers into the layer packet according to the layer packet size comprises: based on the forward calculation time of all the original layers acquired in the step S1, calculating the average forward calculation time of the single original layer ; Based on the data transmission time of the original layer acquired in the step S1, calculating the average data transmission time of a single original layer from an external storage medium to a GPU memory ; Calculating prefetch overhead ratio ; And setting the layer packet size as a larger value of a preset minimum value and K, setting the layer packet size according to the value, and combining the original layers into the layer packet, wherein the minimum value is 2.
4. The method of claim 3, wherein the large language model is a model based on a transform architecture, the original layer is a transform layer, and the transform layers are uniformly combined according to a set layer packet size.
5. The method according to claim 1, characterized in that in step S3: when configuration is selected, the complete model parameters are stored in the shared memory of the CPU and are accessed by all training processes; when the second configuration is selected, the complete model parameters are stored in the solid state disk; And in both configurations, a layer buffer for temporarily storing prefetch parameters and a gradient buffer for temporarily storing offload gradients are created in the page lock memory of the CPU.
6. The method according to claim 1 or 5, wherein the GPU memory space allocated in step S4 is capable of accommodating at least one layer packet being computed and one layer packet being prefetched simultaneously.
7. The method of claim 5, wherein the hook function registered in step S4 comprises: The forward pre-hooking function is triggered before forward calculation or recalculation of the original layer and is used for pre-fetching the parameters of the original layer in the next layer packet to be executed from the layer buffer area to the GPU memory; The forward hook function is triggered after the forward calculation of the original layer is completed and is used for releasing the space occupied by the layer of parameters in the GPU memory; the reverse hook function is triggered after the completion of the reverse computation of the original layer, and is used to trigger the asynchronous unloading operation of the gradient of the original layer for which the reverse computation was completed in step S6.
8. The method of claim 7, wherein, during the recalculating, after the original layer parameters are prefetched into the GPU memory, their buffer space in the layer buffer is reserved for receiving the layer gradient data offloaded from the GPU after the back-calculation.
9. The method of claim 1, wherein the performing of the asynchronous parameter update in step S7 is performed in parallel with the recalculation and the reverse calculation of other layer packets on the GPU, and step S7 further comprises writing the updated parameters back to the solid state disk and recovering the buffer space occupied by the layer parameters in the layer buffer if the configuration is currently two after the parameter update is completed.

Description

Efficient heterogeneous pipeline parallel training method for commercial server and large language model Technical Field The invention belongs to the field of parallel training of large language models, and particularly relates to a high-efficiency heterogeneous pipeline parallel training method for a commercial server and a large language model. Background With the advent of large language models (Large Language Model, LLM) based on a transducer structure, the model parameter volume has significantly increased, and the parameter volumes of known large language models such as GPT-3, LLaMA have all broken through in the order of trillion, which results in the memory requirement of model training far exceeding the upper limit of the memory of a single GPU. While LLM exhibits remarkable performance, only large enterprises equipped with expensive computing center-level servers are currently able to explore and develop LLM. For example, the capital required to train GPT-3 is up to millions of dollars. Obviously, most scientific researchers are limited to only having commercial servers consisting of a small amount of low-cost graphics cards and lower CPU memory, and training or fine tuning of LLM is difficult. In response to this problem, some methods alleviate the video memory pressure by offloading data in the GPU to CPU memory or Solid state disk (Solid STATE DRIVE, SSD). However, most of these works are based on the observation of convolutional neural networks, which are not designed with consideration of LLM characteristics. For example, the virtual deep neural network (vDNN) selectively offloads the output of the convolutional layer because the active value memory footprint is the largest in the training of the convolutional neural network. For another example, superneurous techniques choose to offload the convolutional layer output rather than recalculate it, also based on the characteristic of the convolutional layer's large computational overhead. More importantly, the methods are designed aiming at a single GPU scene, and the parallel computing capacity of multiple GPUs and the aggregated video memory space cannot be fully utilized. The parallel training method utilizes the memory aggregated by a plurality of GPUs to share the memory overhead in the training process. For example, pipeline parallelism and tensor parallelism employ inter-layer and intra-layer parameter segmentation methods, respectively, which enable a larger trainable model size by enabling each GPU to save only some parameters of the model. However, in the training scenario for LLM, model parameters and optimizer states are the main overhead to occupy GPU memory. Particularly, in a commercial server with only a small amount of low-cost graphics cards, even if the memory resources of all GPUs are integrated, the huge GPU memory requirements of LLM cannot be met. Taking fine-tuning GPT-3 as an example, this task requires computational support for 9 DGX-2 servers. Thus, the cost of training or fine-tuning LLM by relying solely on GPU aggregate memory is still unacceptable to most data scientists. Heterogeneous data parallel methods, such as ZeRO-Offload and ZeRO-Infinity, combine data parallel and external storage techniques to further expand the scale of the trainable model. The ZeRO-Offload offloads the optimizer state to the CPU memory, but the model training scale that can be supported by the method is still limited by the GPU memory, since each GPU needs to store a complete set of model parameters. ZeRO-Infinity then cuts the parameters of each layer of the network on the basis, and unloads the cut parameters to SSD. However, since the segmented parameters need to be restored to the complete form before use, and the superposition effect of the parameter pre-fetching operation is added, the GPU peak memory occupation is not obviously reduced compared with ZeRO-Offload. Furthermore, the above methods all require frequent collective communication of parameters and gradients, which results in poor performance in commercial servers that rely solely on PCIe (Peripheral Component Interconnect Express) to effect inter-device communication. The heterogeneous tensor parallel method Stronghold also introduces external storage devices, and after the parameters are calculated, the parameters are also unloaded to the CPU memory. According to the method, through parameter segmentation of each layer of the model, the memory pressure of the GPU is effectively reduced, so that each GPU only needs to store part of parameters of each layer. However, it should be noted that, because the parameters are divided in layers, the results obtained by the independent calculation of each GPU still need to be summed through frequent aggregate communication to generate a complete activation or gradient, which significantly reduces the training efficiency. The heterogeneous pipeline parallel method WA-Pipe only needs to transmit an activation value between two adjace