CN-121996258-A - Large model rapid deployment method and system integrating structure pruning and quantitative compression

CN121996258ACN 121996258 ACN121996258 ACN 121996258ACN-121996258-A

Abstract

The invention provides a rapid deployment method and a rapid deployment system for a large model with a combined structure pruning and quantitative compression, and relates to the technical field of model weight reduction. The invention provides a collaborative optimization framework of pruning and quantization compression of a fusion structure, which aims at solving the problem that pruning and quantization are not enough in cooperation and a pain point relying on manual parameter adjustment in the prior art facing to tasks in the vertical field. The framework automatically and cooperatively determines an optimal model pruning structure and a quantization strategy through one-time end-to-end combined training, and strictly executes structured pruning, so that high-efficiency compression and rapid edge-end deployment of a large model are realized on the premise of ensuring controllable task precision.

Inventors

BAI XUE
FU FANGBIN
XU MING
QI XUGUANG
WANG XIAOYI
DING JIXIN
CHEN ZHAOYUE

Assignees

北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (10)

1. A rapid deployment method of a large model with a fusion structure pruning and quantization compression is characterized by comprising the following steps: Performing initialized fine adjustment on the basic model by using a specific task data set in the target vertical field to obtain an initial field adaptation large model; optimizing training is carried out on the initial field adaptation large model through a task data set in the vertical field, and in the training process, automatic pruning decision and self-adaptive allocation combined training of mixed precision bit width are carried out on the field adaptation large model in the training process through a collaborative optimization framework of fusion structure pruning and quantization compression, so that a trained field adaptation large model is obtained; And (3) deriving a field adaptation large model, performing hard pruning and hard quantization on the field adaptation large model to obtain a lightweight large model, performing edge equipment hardware adaptation processing on the lightweight optimization model to generate a hardware compatible model file, and deploying the model file to target edge equipment to complete rapid deployment of the large model.
2. The fusion structure pruning and quantization compression large model rapid deployment method of claim 1, wherein the fusion structure pruning and quantization compression collaborative optimization framework comprises: a learnable structure mask and a micro-bit-width selection; Wherein, the structure mask refers to not directly weighting Applying a sparsity constraint, inserting a learnable structural masking factor after each structural element of the model In the forward direction In the training process, is to Training sparsity When the representing corresponding structure is logically closed; The micro bit width selection refers to presetting a group of candidate bit width for each layer, and introducing a leachable bit width probability parameter And utilize Gumbel-Softmax skill pair Sampling is carried out, and softening and guidance of discrete bit width selection are realized.
3. The method for rapidly deploying a large model by fusion structure pruning and quantization compression according to claim 2, wherein the optimizing training of the initial domain adaptation large model by the task data set in the vertical domain, and the loss function in the training process comprises: Wherein, the Representing a total loss function; Representing a task loss function; Representing a structure sparseness loss function, is directed to a structure mask Regularization of (2); Representing a resource-aware loss function; An index representing a model layer or a structural unit, ; Representing the total number of layers or the total number of structural units participating in collaborative optimization in a basic model; Represent the first The layer selects the bit width resource coefficient corresponding to the quantization bit width; Represent the first The layer reserves the corresponding structure resource coefficient of pruning structure; Represent the first Comprehensive resource overhead of the layer on the target hardware; And Is a weight parameter.
4. The method for rapidly deploying a large model by fusion structure pruning and quantization compression according to claim 3, wherein the combined training of automatic pruning decision and adaptive allocation of mixed precision bit width on the large domain adaptation model in the training process through the collaborative optimization framework of fusion structure pruning and quantization compression to obtain a trained large domain adaptation model comprises the following steps: Training the initial domain adaptation large model through a task data set in the vertical domain, and updating the weight by gradient in the training process Structure mask Quantization parameter Bit width probability parameter And realizing the combined training of automatic pruning decision and adaptive allocation of mixed precision bit width.
5. The fusion structured pruning and quantitatively compressed large model rapid deployment method of claim 4, wherein the automated pruning decision comprises: For each structural unit, the structural mask corresponding to the kth structural unit Are affected by two opposing gradients, including a downward pressure gradient and an upward support gradient, wherein, The downward pressure gradient refers to the resource loss function And structure sparsity loss function Will produce a gradient and Pushing to 0; The upward support gradient means the loss of task if the kth structural unit is critical to the model accuracy The generated counter-propagating gradient will be significantly increased Maintained at 1.
6. The fusion structured pruning and quantization compressed large model rapid deployment method according to claim 4, wherein said adaptive allocation of hybrid precision bit widths comprises: bit-width probability parameters for each layer using Gumbel-Softmax Sampling, wherein the following rules are followed in the sampling process: selecting a low bit width if the first layer is insensitive to quantization noise; if the first layer is extremely sensitive to quantization noise, the high bit width is selected.
7. The method for rapidly deploying a large model by pruning and quantitatively compressing a fusion structure according to any one of claims 3-6, wherein the task loss function is a cross entropy loss function.
8. The large model rapid deployment system for the pruning and the quantization compression of the fusion structure is characterized in that the large model rapid deployment system for the pruning and the quantization compression of the fusion structure is used for executing the large model rapid deployment method for the pruning and the quantization compression of the fusion structure according to any one of claims 1-7.
9. A computer-readable storage medium storing a computer program for large model rapid deployment of fusion structure pruning and quantization compression, wherein the computer program causes a computer to execute the large model rapid deployment method of fusion structure pruning and quantization compression according to any one of claims 1 to 7.
10. An electronic device, comprising: One or more processors; Memory, and One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the program comprising a large model rapid deployment method for performing fusion structure pruning and quantization compression as claimed in any one of claims 1-7.

Description

Large model rapid deployment method and system integrating structure pruning and quantitative compression Technical Field The invention relates to the technical field of model weight reduction, in particular to a rapid deployment method and system for a large model with a fusion structure pruning and quantitative compression. Background In recent years, large pre-training models represented by BERT, GPT and the like realize breakthrough progress in various fields of natural language processing, computer vision and the like, and greatly promote the industrial application of artificial intelligence technology. However, the large pre-training model usually contains billions or even billions of parameters, deployment and application of the model face three core problems of high storage cost, high computational complexity and high energy consumption, namely, the model file volume generally reaches GB or even TB level and is difficult to adapt to terminal equipment with limited resources, the reasoning process needs to execute massive floating point operations at the computing level, so that the reasoning delay is higher and real-time requirements cannot be met, and the huge computation amount brings extremely high power consumption at the energy consumption level, so that the application scene of the model file on embedded or mobile equipment is severely limited. Meanwhile, the development of the 'edge intelligent' technology requires pushing intelligent computing power to a data source, namely directly running an AI model on an embedded terminal such as a smart phone, an automatic driving unit, an industrial IoT device and the like, and the requirement puts forward a strict requirement on the light weight degree of the model, so that academia and industry are pushed to continuously explore an efficient model compression technology, and efficient deployment of a large model on the edge device is realized. In order to solve the deployment difficulty of a large model, a plurality of model compression schemes are provided in the prior art, and the model compression schemes mainly comprise four core technologies of parameter pruning, parameter quantization, knowledge distillation and parameter sharing. The parameter pruning is realized by eliminating redundant or unimportant weights in the model, and can be divided into unstructured pruning aiming at single weight operation and structural pruning of a whole network channel, an attention head or a whole network layer, the parameter quantization is realized by reducing the data representation precision of model parameters and activation values (such as reducing from 32-bit floating points to 8-bit integer or lower), the storage and calculation cost is reduced, the knowledge distillation is realized by utilizing a large-scale teacher model to guide the training of a lightweight student model, so that the performance of the teacher model is reproduced while the weight of the student model is kept, and the parameter sharing is realized by enabling a plurality of neurons to share the same group of parameters to reduce the total parameter quantity, so that the method is particularly widely applied to the attention mechanism of a transducer architecture. However, the existing model compression technology generally regards pruning and quantization as independent steps, and the serial execution mode has the obvious defects that in the modes of pruning firstly and then quantizing, the distribution and dynamic range of model parameters are changed, so that the subsequent quantizing difficulty is increased, quantizing sensitivity is changed suddenly, serious precision loss is caused, in the modes of pruning firstly and then quantizing, the information loss introduced in the quantizing process can lead to pruning decision errors, the model structure which is critical to the precision is deleted erroneously, the precision is difficult to recover, although research and trial pruning and quantization combined optimization exist, the pruning object of the existing combined method is unstructured and lacks automatic decision, namely, the existing combined pruning and quantization optimization scheme focuses on parameter level pruning, the generated sparse matrix is difficult to adapt to edge equipment hardware (such as CPU/GPU/NPU), effective reasoning acceleration cannot be realized, even if layer pruning is adopted, the important structure is required to be deleted erroneously easily, and the automatic decision capability is avoided. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides a large model rapid deployment method and system for pruning and quantization compression of a fusion structure, which solve the technical problems that a pruning object of the existing pruning and quantization combined optimization scheme is unstructured and lacks automatic decision. (II) technical scheme In order to achieve the a