CN-122021770-A - Global budget constraint-based large language model heterogeneous structure unit joint structured pruning method, device, equipment and storage medium

CN122021770ACN 122021770 ACN122021770 ACN 122021770ACN-122021770-A

Abstract

The application relates to the technical field of deep learning, in particular to a large language model heterogeneous structural unit joint structured pruning method, device, equipment and storage medium based on global budget constraint, wherein the method comprises the steps of determining candidate structural units, wherein the candidate structural units comprise feedforward network structural units and attention mechanism structural units; the method comprises the steps of establishing a unified cost model, setting global resource budget constraint, inputting a natural language text sequence without labels into a pre-training large language model, training a preset learning gating scoring variable, fixing a binary mask obtained by training after training convergence, and pruning the pre-training large language model based on the binary mask obtained by training to obtain a model to be deployed. The application performs joint selection on the feedforward network structural unit and the attention mechanism structural unit under the constraint of uniform budget, and compared with the pruning model in the prior art, the accuracy rate of executing downstream tasks is improved through the structured pruning model under the same budget.

Inventors

QIN JIANBIN
WANG ZIYANG

Assignees

深圳大学

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (10)

1. A large language model heterogeneous structural unit joint structured pruning method based on global budget constraint, the method comprising: Determining candidate structural units in a pre-training large language model to be pruned, wherein the candidate structural units comprise a feedforward network structural unit and an attention mechanism structural unit; constructing a unified cost model for all candidate structural units, and setting global resource budget constraint corresponding to the cost model; acquiring a natural language text sequence without labels, inputting the natural language text sequence into a pre-training large language model, keeping the weight of the pre-training large language model unchanged, and training a preset learning gating scoring variable based on a cost model and global resource budget constraint; after training convergence, fixing a binary mask obtained by training corresponding to each candidate structural unit, pruning the pre-training large language model based on the binary mask obtained by training to obtain a model to be deployed, and predicting an output text sequence according to the input text sequence based on the model to be deployed.
2. The global budget constraint based large language model heterogeneous structure unit joint structured pruning method according to claim 1, wherein obtaining a natural language text sequence without labels, inputting the natural language text sequence into a pre-trained large language model, keeping the weight of the pre-trained large language model unchanged, training a preset learnable gating scoring variable based on a cost model and a global resource budget constraint, comprising: Acquiring a natural language text sequence without labels, inputting the natural language text sequence into a pre-training large language model, keeping the weight of the pre-training large language model unchanged in each training iteration of a preset learning gating scoring variable, and determining the current gating score of each candidate structural unit by using the learning gating scoring variable; Generating binary masks of all the candidate structural units based on the current gating score, a cost model and global resource budget constraints; determining a current model structure of a pre-training large language model based on the binary mask, and performing forward calculation based on the current model structure to obtain token probability distribution; acquiring a target token sequence in a current natural language text sequence, and calculating cross entropy loss between token probability distribution and the target token sequence; based on the cross entropy loss, gradients are calculated by back propagation and the learnable gating scoring variables are updated according to the gradients.
3. The global budget constraint based large language model heterogeneous structural unit joint structured pruning method as recited in claim 2, wherein generating a binary mask of all of said candidate structural units based on said current gating score, a cost model and a global resource budget constraint comprises: calculating the cost of the structural unit corresponding to each candidate structural unit by using the cost model; Sequencing all the candidate structural units according to the current gating score of each candidate structural unit to obtain a sequencing result; And accumulating the structure unit costs of the candidate structure units item by item according to the sorting result, and generating binary masks of all the candidate structure units meeting the global resource budget constraint.
4. The global budget constraint based large language model heterogeneous structural unit joint structured pruning method according to claim 2, wherein calculating gradients by back propagation based on the cross entropy loss and updating the learnable gating scoring variables according to gradients comprises: Based on the cross entropy loss, adopting a straight-through estimator to reversely propagate approximation to obtain gradient; And directly transferring the gradient from the preset discrete mask path to the learnable gating scoring variable so as to conduct guide training on the learnable gating scoring variable.
5. The global budget constraint based large language model heterogeneous structure unit joint structured pruning method as set forth in claim 1, wherein pruning the pre-trained large language model based on the training obtained binary mask to obtain the model to be deployed comprises: performing physical clipping on the pre-training large language model based on the binary mask obtained by training to obtain a dense sub-model; and performing scaling calibration on the dense sub-model to obtain a model to be deployed.
6. The global budget constraint based large language model heterogeneous structure unit joint structured pruning method as set forth in claim 5, wherein performing physical clipping on the pre-trained large language model based on the training resulting binary mask to obtain a dense submodel comprises: Determining a feedforward network structural unit and an attention mechanism structural unit which are reserved in a pre-training large language model based on a binary mask obtained by training; And physically removing the unreserved feedforward network structural units and the attention mechanism structural units in the dimension of the weight tensor to obtain a dense submodel.
7. The global budget constraint based large language model heterogeneous structure unit joint structured pruning method according to claim 5, wherein scaling calibration is performed on the dense sub-model to obtain a model to be deployed, comprising: Maintaining the binary mask obtained by training unchanged, and applying a preset scaling factor to the feedforward network structural unit and the attention mechanism structural unit which are reserved in the dense submodel; Optimizing the scaling factor by using a natural language text sequence without labels to compensate for scale variation; and folding and writing the optimized scaling coefficient into the weight of the dense submodel to obtain the model to be deployed.
8. A large language model heterogeneous structural unit joint structured pruning device based on global budget constraints, the device comprising: The structure unit determining module is used for determining candidate structure units in the pre-training large language model to be pruned, wherein the candidate structure units comprise a feedforward network structure unit and an attention mechanism structure unit; The construction module is used for constructing a unified cost model for all the candidate structural units and setting global resource budget constraint corresponding to the cost model; The model training module is used for acquiring a natural language text sequence without labels, inputting the natural language text sequence into a pre-training large language model, keeping the weight of the pre-training large language model unchanged, and training a preset learning gating scoring variable based on a cost model and global resource budget constraint; The model determining module is used for fixing the binary masks obtained by training corresponding to the candidate structural units after training convergence, pruning the pre-training large language model based on the binary masks obtained by training to obtain a model to be deployed, and predicting the output text sequence according to the input text sequence based on the model to be deployed.
9. An apparatus comprising a memory, a processor, and a global budget constraint based large language model heterogeneous structural unit joint structured pruning program stored on the memory and executable on the processor, the global budget constraint based large language model heterogeneous structural unit joint structured pruning program when executed by the processor implementing the steps of the global budget constraint based large language model heterogeneous structural unit joint structured pruning method according to any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that it stores a computer program executable for implementing the steps of the global budget constraint based large language model heterogeneous structural unit joint structured pruning method according to any one of claims 1-7.

Description

Global budget constraint-based large language model heterogeneous structure unit joint structured pruning method, device, equipment and storage medium Technical Field The invention relates to the technical field of deep learning, in particular to a large language model heterogeneous structure unit joint structured pruning method, device, equipment and storage medium based on global budget constraint. Background In the context of large language models entering real applications and engineering deployments, the importance of model compression techniques has shifted from being able to run to being able to provide services stably at a controlled cost. Unlike the early model, the reasoning cost of the large language model is not only determined by the parameter scale, but is also influenced by factors such as weight reading bandwidth, tensor shape of matrix multiplication operation, and video memory occupation of key value cache in the autoregressive decoding stage. In a typical graphics processor reasoning environment, model weights need to reside in a video memory and be frequently read, throughput is directly affected by the weight volume when the system presents a bandwidth limited feature, long context inputs can significantly amplify intermediate tensors related to attention computation and activate read-write, and autoregressive decoding needs to maintain a hierarchical key value buffer in order to avoid repeated computation, the occupation of which linearly increases along with the length of the context and the quantity of concurrent requests, and the key contradiction between concurrent throughput and video memory capacity is determined. Therefore, if the compression method only reduces the number of parameters and cannot synchronously reduce the structural overhead of the reasoning stage, the method is difficult to stably convert into end-to-end income, and on the contrary, the method can directly change the dimensional shape of matrix multiplication and simultaneously reduce the structural pruning of key value cache width, so that the calculation, the video memory and the bandwidth pressure are more likely to be simultaneously improved. Existing pruning techniques generally experience an evolution from deleting individual weights to deleting building blocks. Early pruning is based on unstructured weight pruning, typically classical pruning based on second-order sensitivity information and subsequent amplitude pruning and iterative fine tuning processes, but unstructured sparsity often cannot bring end-to-end acceleration when an efficient sparse computation kernel and compiler support are absent. To promote floor-friendliness, research is gradually turned to structured pruning by aligning pruning objects to tensor dimensions and operator shapes, such as channel pruning and sparse regularization methods, and attention-oriented head pruning. With the rise of large language models, low-cost pruning after training becomes a mainstream demand, a method for constructing importance measurement based on a small amount of calibration data and a method for reducing error propagation through block-level approximate update to realize one-time large-proportion pruning are presented, and meanwhile, channel cutting and interval cutting pipelines facing large language model structures are widely explored. The method can quickly obtain a smaller dense model under the condition of not carrying out complete fine tuning, so that the deployment threshold is lowered. However, although the existing structured pruning method can improve the execution speed and reduce the hardware requirement, the default structural units of the existing structured pruning method can contribute additionally and are independent of each other, and because functions are often carried cooperatively by multiple structures in a large language model, unit-level ordering is easy to delete a key ring in complementary combinations by mistake, thereby triggering nonlinear performance degradation under high compression rate, and resulting in lower accuracy of the structured pruning model in executing downstream tasks. For example, when a text generation task is executed, text generation is performed based on a model obtained by an existing structured pruning method, and the accuracy of the obtained predicted text is low. Accordingly, the prior art has drawbacks and needs to be improved and developed. Disclosure of Invention The application provides a large language model heterogeneous structure unit joint structured pruning method, device, equipment and storage medium based on global budget constraint, which are used for solving the technical problem that the accuracy of a structured pruning model in the related art is low when a downstream task is executed. In order to achieve the above purpose, the present application adopts the following technical scheme: a large language model heterogeneous structural unit joint structured pruning method based on glo