CN-122021767-A - 2-Bit quantization method of large language model based on residual refinement
Abstract
The invention discloses a 2-bit quantization method of a large language model based on residual refinement, which is applied to a model reasoning system driven by a computer processor, loads a pre-trained large language model to be quantized, prepares a calibration data set or synthesized data for quantization perception training, performs rough approximation on an original weight through a first 1-bit kernel, calculates residual errors, performs refinement quantization on a second 1-bit kernel on the residual errors, reconstructs the weight through linear combination of the two scaled 1-bit kernels, approximates the gradient of a quantization function by a straight-through estimator, and fine-adjusts model parameters through quantization perception training until the model converges. The self-adaptive quantization point is constructed, the non-uniform distribution of the weights can be flexibly fitted, and the reasoning precision, training stability and convergence rate of the model are obviously improved while the extremely high compression rate is maintained.
Inventors
- SHI JIEQI
- CHEN JIAYI
- HUO JING
- GAO YANG
Assignees
- 南京大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260127
Claims (10)
- 1. The 2-bit quantization method of the large language model based on residual refinement is applied to a model reasoning system driven by a computer processor and is characterized by comprising the following steps of: (1) Loading a pre-training large language model to be quantized, and preparing a calibration data set or synthesized data for quantization perception training, wherein the data meets the input format requirement of model quantization training; (2) Extracting a full-precision original weight matrix W of a pre-training large language model to be quantized, dividing the weight matrix W into a plurality of weight groups which are not overlapped with each other according to a preset size, wherein the number of elements of each weight group is a preset fixed value G; (3) Carrying out 1-bit coarse granularity quantization of the first stage on each weight group, and calculating to obtain a first binary direction vector And a first scaling factor Constructing rough approximate weights; (4) Calculating residuals between original weights and coarse approximation weights ; (5) For the residual error Performing 1-bit refinement quantization in the second stage, and calculating to obtain a second binary direction vector And a second scaling factor ; (6) Combining the quantization results of the first stage and the second stage to construct final 2-bit quantization weights For forward propagation of the network; (7) And (3) utilizing a straight-through estimator to approximate the gradient of the quantization function, and training and fine-tuning model parameters through quantization perception until the model converges.
- 2. The 2-bit quantization method of a large language model based on residual refinement according to claim 1, wherein the pre-trained large language model is a large language model of a transducer architecture and comprises Llama, OPT, qwen series of models.
- 3. The 2-bit quantization method of a large language model based on residual refinement according to claim 1, wherein the calibration data set comprises a general task data set or a specific task customization data set in the field of natural language processing, and the synthesized data is obtained by performing data enhancement processing on a real corpus.
- 4. The method of 2-bit quantization of a large language model based on residual refinement according to claim 1, wherein the 1-bit coarse-grain quantization of the first stage of step (3) is obtained by solving the following optimization problem: Wherein, the Indicating the Frobenius norm.
- 5. The method for 2-bit quantization of a large language model based on residual refinement of claim 1, wherein said first binary direction vector of step (3) The calculation formula of (2) is as follows: Wherein, the As a sign function, the first scaling factor The calculation formula of (2) is as follows: Wherein, the In order to be the size of the weight set, Is the L1 norm.
- 6. The method for 2-bit quantization of large language model based on residual refinement of claim 1, wherein said residual is obtained in step (4) The method comprises the following steps: 。
- 7. The method for 2-bit quantization of a large language model based on residual refinement of claim 1, wherein said second binary direction vector of step (5) And a second scaling factor The method comprises the following steps: Wherein, the Is a sign function; Is the size of the weight group.
- 8. The method of 2-bit quantization of a large language model based on residual refinement of claim 1, wherein step (6) said final 2-bit quantization weights The method comprises the following steps: Wherein, the 、 The element takes the value of Representing the directions of coarse-grain approximation and fine-grain residuals, respectively; And Is a positive real scaling factor, controlling the amplitudes of the two components respectively; constructing an adaptive quantization codebook: 。
- 9. the method of claim 1, wherein the step (7) approximates the gradient of the quantization function with a straight-through estimator, and the formula of the back propagation of the gradient is: Wherein, the Is an identity matrix.
- 10. The 2-bit quantization method of a large language model based on residual refinement according to claim 1, wherein the quantized perceptual training in step (7) adopts a knowledge distillation frame, takes a full-scale model with BF16 precision as a teacher model, takes a 2-bit quantization model as a student model, optimizes by minimizing KL divergence of output distribution of the two models, and has a calculation formula of KL divergence: Wherein, the And Representing teacher model and student model for input, respectively Predicting the first Probability distribution of individual categories, the optimization objective being to distribute the output of the student model As close as possible to the output distribution of the teacher model 。
Description
2-Bit quantization method of large language model based on residual refinement Technical Field The invention belongs to the technical field of deep learning model compression and quantization, and particularly relates to an extremely low bit (2-bit) model compression and quantization method aiming at a large language model (Large Language Models, LLMs). Background The large language model has revolutionized progress in the field of natural language processing, and exhibits excellent performance in tasks such as text generation and complex reasoning. However, with the exponential growth of model parameter scale (from billions to billions of parameters), its enormous memory footprint and computational cost present a significant challenge to practical deployment, especially on resource-constrained edge devices (e.g., smartphones, AR/VR head displays, drones). To alleviate this problem, model quantization techniques have been developed to compress model volumes by reducing numerical accuracy. Although 8-bit (INT 8, FP 8) and 4-bit (INT 4, NF 4) quantization are relatively mature and widely used, 2-bit quantization is a current research hotspot in order to meet the extreme compression requirements. However, 2-bit quantization can lead to a catastrophic degradation in model accuracy, and how to maintain model performance at very low bit widths is a critical issue in this field. Currently existing quantization methods can be classified mainly from two dimensions, quantization format and quantization strategy: (1) In terms of quantization format, it is mainly divided into (i) 8-bit quantization, such as llm.int8 (), using vector level quantization to implement INT8 reasoning, and to represent excellent FP8 format in training reasoning, (ii) 4-bit quantization, such as normal floating point 4 (NF 4) format introduced by QLoRA, combined with double quantization technique, (iii) very low bit (1 bit/three value) quantization, using indicator function and symbol function to perform binarization or three value mapping, (iv) 2-bit quantization, i.e. a compromise scheme between 1 bit and 4 bits. (2) In terms of Quantization strategies, the method is mainly divided into (i) Post-training Quantization (Post-Training Quantization, PTQ) without retraining, using Hessian matrix information for example GPTQ, using active perceptual scaling for AWQ and SmoothQuant, and (ii) Quantization perceptual training (Quantization-AWARE TRAINING, QAT) for incorporating Quantization into training, using knowledge distillation technology for example LLM-QAT. The prior art has made a certain progress in the aspect of model compression, but various methods have obvious short plates when facing to the extreme compression requirement. (1) In the aspect of quantization format, 8-bit and 4-bit quantization has stable performance, but the compression rate is still insufficient under the extremely severe storage limit of edge equipment, and binary/ternary (1/1.58 bit) quantization has extremely high compression rate, but the method often causes the original capacity of a model to be difficult to maintain, and the model is usually required to be trained from scratch, so that the cost is extremely high. (2) In terms of quantization strategy, quantization after training (PTQ) is excellent in 4 bits and above, but its performance is drastically reduced in very low bit width such as 2 bits, and quantization perception training (QAT) is more excellent in performance but suffers from a problem of large resource consumption. Furthermore, existing core techniques for 2-bit quantization suffer from substantial drawbacks, resulting in a catastrophic drop in model accuracy. Most existing 2-bit methods directly follow the integer format of high-bit quantization, mainly relying on the Round-To-Nearest (RTN) strategy. The strategy uses a linear mapping based on static points, i.e. quantization points are determined by global scaling and panning. Such static bits perform well with 4 bits or higher but face serious problems with 2 bit settings. The weight distribution of large language models typically presents a non-uniform bell-shaped distribution (highly centered around zero with long tails), and the uniform quantization points provided by RTN cannot accommodate this distribution characteristic, since 2 bits can only provide four discrete values. This results in the forced merging of most of the weights onto a few quantization points near the center, while the quantization points at the edges have little weight to fall into, resulting in a great waste of limited expressive power and thus a serious degradation of model accuracy. In order to solve the above problems, the present invention proposes a quantization method that can adaptively fit non-uniform weight distribution and still maintain high accuracy at very low bits. The scheme adopts a quantization strategy based on residual refinement, breaks through the limitation of the traditional static point location,