CN-121981266-A - Large model quantization method for rotation matrix and error compensation

CN121981266ACN 121981266 ACN121981266 ACN 121981266ACN-121981266-A

Abstract

The invention provides a large model quantization method for rotation matrix and error compensation, which belongs to the technical field of large language model quantization, and comprises the steps of obtaining statistical characteristics based on a large language model to be quantized; the method comprises the steps of obtaining a layer to be rotated of a huge amount of outliers, constructing a rotation matrix, carrying out smoothing treatment by constructing a hierarchical selection function to obtain a layer with smooth distribution, quantizing the layer with smooth distribution by utilizing a unified quantization operator, carrying out error compensation by utilizing a structured block reconstruction compensation mechanism based on black plug approximation, obtaining an updated layer by obtaining an optimal compensation amount, and absorbing the rotation matrix to the updated layer to obtain deployment weight parameters of a large language model to be quantized. The invention solves the problems of quantization instability caused by activation and serious bias of weight distribution in low-bit quantization, quantization interval compression caused by huge outliers, excessive discretization of weights and layer-by-layer accumulated error amplification of the existing large-model quantization method.

Inventors

GUO JINYANG
WU JIAJUN
YANG JIAN

Assignees

北京航空航天大学

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (9)

1. The large model quantization method for the rotation matrix and the error compensation is characterized by comprising the following steps of: S1, acquiring a calibration set based on a large language model to be quantized, and activating according to input of each layer in the large language model to obtain statistical characteristics; S2, acquiring a layer to be rotated containing a huge amount of outliers according to the statistical characteristics of each layer, constructing a rotation matrix of the layer to be rotated, and carrying out smoothing treatment on the layer to be rotated by constructing a level selection function according to a calibration set to obtain a layer subjected to distributed smoothing; s3, quantizing the distributed smooth layer by using a unified quantization operator and calculating a channel-level scaling factor to obtain a layer quantized by low bits; S4, utilizing a structured block reconstruction compensation mechanism based on black plug approximation, performing error compensation on the layer quantized by low bits by obtaining an optimal compensation amount to obtain an updated layer, and obtaining deployment weight parameters of the large language model to be quantized by absorbing the rotation matrix to the updated layer to complete large model quantization.
2. The large model quantization method of rotation matrix and error compensation according to claim 1, wherein S1 comprises the steps of: s101, acquiring a calibration set by selecting a preset representative input sample based on a large language model to be quantized; S102, based on a calibration set, forward reasoning is carried out on a large language model to be quantized, and statistics containing mean value, variance, skewness and kurtosis are obtained according to input activation of each layer in the large language model; S103, by defining the long tail ratio and setting an experience threshold value and combining statistics, the statistical characteristics are obtained.
3. The large model quantization method of rotation matrix and error compensation according to claim 2, wherein the expression of the long tail ratio is as follows: Wherein, the The counting operator is represented by a number of bits, Represents the coefficient of experience that is to be found, The standard deviation is indicated as such, Representing the number of samples or activation vectors on the calibration set that are involved in statistics, Representing large model number The dimensions of the layer channel are such that, Representing the long tail ratio.
4. The large model quantization method of rotation matrix and error compensation according to claim 2, wherein S2 comprises the steps of: S201, according to kurtosis and an empirical threshold in the statistical characteristics of each layer, taking a layer with kurtosis far larger than the empirical threshold and a high concentration of a long tail ratio on a small number of channels as a layer to be rotated of a large amount of outliers contained; s202, according to input activation and weight of a layer to be rotated, constructing a rotation matrix of the layer to be selected by selecting a normalized Hadamard matrix as an orthogonal matrix and introducing a block rotation form; S203, performing grid search according to the calibration set by introducing a dynamic insertion mechanism to obtain index weights, and constructing to obtain a hierarchical selection function; S204, calculating to obtain a hierarchy score of the layer to be rotated by using a hierarchy selection function; s205, determining to obtain a rotation strategy by adopting a decision rule according to the hierarchy score; S206, based on a rotation strategy, smoothing the layer to be rotated by utilizing a rotation matrix, and obtaining the input activation and the weight to be quantized by applying linear orthogonal transformation to the input activation and the weight to obtain a layer with smooth distribution.
5. The large model quantization method of rotation matrix and error compensation according to claim 4, wherein the expression of the hierarchical selection function is as follows: Wherein, the The level score is represented by a number of levels, Representing the function of the selection of the hierarchy, The degree of deviation is indicated by the degree of deviation, The kurtosis is indicated as being indicative of the degree of kurtosis, Represents the degree of the truncated kurtosis, The long tail ratio is indicated as such, The sequence autocorrelation coefficients are represented by a sequence, A weight coefficient representing the degree of skewness, A weight coefficient representing the kurtosis, A weight coefficient representing the long-tail ratio, And a weight coefficient representing the sequence autocorrelation coefficient.
6. The large model quantization method of rotation matrix and error compensation according to claim 4, wherein said S3 comprises the steps of: S301, calculating a channel-level scaling factor by utilizing a unified quantization operator and utilizing a preset mixing strategy, introducing a local correction coefficient, and quantizing the weight to be quantized in the layer with smooth distribution to obtain the quantized weight; s302, carrying out quantization on input activations to be quantized in a layer with smooth distribution by adopting symmetrical or asymmetrical dynamic quantization, and mapping the input activations to be quantized to a preset low-bit numerical format to obtain quantized input activations; s303, activating according to the quantized weight and the quantized input to obtain a layer quantized by low bits.
7. The large model quantization method of rotation matrix and error compensation according to claim 6, wherein S4 comprises the steps of: s401, obtaining an output error by obtaining a weight error, obtaining linear transformation of a layer quantized by low bits and combining an input feature matrix; S402, measuring quantized performance degradation by using a black plug approximation matrix and a black plug matrix according to square loss of an output error to obtain an optimization problem; s403, dividing the weights into subsets according to blocks to obtain weight blocks, and reducing the optimization problem to obtain a quadratic optimization problem; s404, adopting an approximate replacement strategy based on weight difference, and using a scaling factor to approximately replace an effective gradient item of the weight block to obtain an approximate gradient item; S405, bringing the approximate gradient term into a quadratic optimization problem, performing George' S decomposition on the black plug matrix, and performing linear solution on a preset sub-triangle to obtain an optimal compensation quantity; S406, performing error compensation on the low-bit quantized layer by utilizing the optimal compensation quantity to obtain updated weights, and obtaining updated layers; s407, absorbing the rotation matrix to the updated layer, and rewriting the updated weight to obtain the deployment weight parameters of the large language model to be quantized, thereby completing the quantization of the large model.
8. The large model quantization method of rotation matrix and error compensation according to claim 7, wherein the expression of the quadratic optimization problem is as follows: Wherein, the Representing the weight block of the b-th block, The trace operation of the matrix is represented, A black plug matrix representing the b-th weight block, Representing the effective gradient term for characterizing the sensitivity direction of quantization error to loss, Representing block b.
9. The large model quantization method of rotation matrix and error compensation according to claim 7, wherein the expression of the updated weights is as follows: Wherein, the Representing the updated weight of the object to be weighted, Representing the quantized weights of the b-th block weight block, Representing the optimum compensation amount.

Description

Large model quantization method for rotation matrix and error compensation Technical Field The invention belongs to the technical field of large language model quantization, and particularly relates to a large model quantization method for rotation matrix and error compensation. Background In recent years, large language models (Large Language Models, LLMs) represented by LLaMA, GPT and Qwen series have become core bases of applications such as natural language processing, knowledge reasoning, intelligent searching, intelligent customer service and the like, the magnitude of parameters often reaches billions or billions, however, high-precision computation (such as FP32/FP 16) of the large models has extremely high requirements on video memory bandwidth and computing power, so that the deployment of the large models in resource-constrained environments such as edge equipment, mobile terminals and general GPUs is extremely limited, and Post-training quantification (Post-Training Quantization, PTQ) is gradually becoming a mainstream technical route in the field of model compression in order to reduce computing overhead and equipment load during reasoning. In Post Training Quantization (PTQ) technology, by converting model weight and activation from floating point format to lower precision (such as INT8, INT4 and even INT3, FP 4), the video memory occupation can be remarkably reduced, and 1.5-3 times of reasoning acceleration can be obtained on part of hardware, however, when the equivalent bit width is reduced to 4 bits and below, quantization errors are greatly amplified by widely existing activation ' outliers ' (outliers) in the model, which become key bottlenecks affecting the quantization effect, and the outliers not only comprise ' normal outliers ' which are shown to be higher in most of the words (token), but also comprise ' huge outliers (extreme outliers) which are shown to be concentrated in few words (token) and have values even far beyond the normal distribution range, the latter often cause the activation scaling factors to be forced to be lifted to be at an abnormally high level, and finally most of the activation values are compressed to the lower end of the fixed-point quantization interval, so that information loss is caused. In recent years, methods such as smooth quantization (SmoothQuant) and omnibearing calibration quantization (OmniQuant) try to alleviate the problem by smoothing activation distribution, adjusting scaling functions or introducing a multidirectional calibration strategy, but these methods are generally based on a fixed parameter optimization strategy, and are difficult to carry out targeted suppression on multi-layer heterogeneous activation distribution, especially when facing huge outliers, a rotation matrix (such as Hadamard matrix) shows natural advantages in terms of distribution smoothness due to the fact that activation energy can be redistributed through orthogonal transformation, and the methods such as rotation quantization (QuaRot) and learning rotation quantization (SpinQuant) are used for preprocessing data before quantization, however, the existing rotation methods mostly adopt a fixed position insertion and equal proportion tiling mode, lack level suitability, cannot carry out dynamic optimal adjustment according to different model depth characteristics, semantic distribution or activation statistical modes, and in addition, new rounding errors can be accumulated in quantization stages after the introduction of the rotation matrix, but the existing Post Training Quantization (PTQ) flow mostly adopts a single-stage static quantization strategy, and does not consider how to carry out high-order reconstruction structure efficient compensation on the weight errors after rotation. In practical engineering application, large language models have been widely deployed in edge computing devices, mobile terminals, localized intelligent systems, and application scenarios with high requirements for data privacy and real-time, such as local intelligent customer service terminals, offline voice and text understanding systems, industrial equipment intelligent operation and maintenance systems, vehicle-mounted intelligent interaction systems, and dedicated local reasoning systems in the fields of medical treatment, finance, and the like. In the application scenario, the model usually needs to run on a hardware platform with limited video memory capacity, limited computing power or limited power consumption, and often cannot rely on cloud high-performance computing resources to complete an reasoning task. Taking an edge device or a local terminal as an example, the available video memory and computing resources of the device generally only support low-bit fixed-point operation or mixed-precision operation, and a large-scale language model is difficult to directly deploy under the condition of not being compressed, so that post-training quantization becomes a key technical