CN-121997987-A - Vision Transformer model compression method based on two-step mixed quantization strategy
Abstract
The invention discloses a Vision Transformer model compression method based on a two-step mixed quantization strategy, which comprises the steps of firstly carrying out quantization sensitivity analysis on a pre-trained Vision Transformer model, providing a two-step mixed quantization strategy, firstly carrying out pre-screening on each linear layer and convolution layer in the model through cosine similarity, rapidly positioning a network layer sensitive to low-bit quantization, secondly carrying out quantization error assessment on the sensitive layer based on a Hessian second-order information fine analysis method, distributing differentiated quantization bit widths for different network layers, realizing mixed precision quantization, calculating quantization parameters in a channel-by-channel asymmetric quantization mode, and simultaneously designing a customized quantization scheme aiming at a special operator in a Vision Transformer model. The method of the invention obviously reduces the storage cost and the calculation complexity of Vision Transformer models on the premise of ensuring that the model reasoning precision is acceptable, is suitable for efficient deployment on the edge equipment with limited resources, and is suitable for quantization after training of a large-scale transducer model by using a self-attention mechanism.
Inventors
- ZHAN JINYU
- ZHANG TANGJIE
- QUAN ZHEN
- JIANG WEI
- YANG KUN
- ZHANG LIANG
- WANG PENG
- Shi Chenqi
- HE YONGKANG
- Wei Tongou
Assignees
- 电子科技大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260303
Claims (5)
- 1. A Vision Transformer model compression method based on a two-step mixed quantization strategy comprises the following specific steps: S1, preparing a pre-training model and a calibration data set; Firstly, preparing a full-precision Vision Transformer model trained on a large-scale dataset, wherein the Vision Transformer model comprises an embedded layer, a multi-layer transducer coding module and an output head module; wherein the Vision Transformer model comprises ViT-Base, viT-Larget, deiT-Base and SAM-Huge; Then, confirming a model structure, initializing quantitative configuration parameters, namely analyzing the types and the parameter scales of all layers in the model, identifying all Linear layers, conv2d layers, layerNorm layers and Softmax layers, and definitely determining the core structure of the loaded model, wherein the core structure comprises a Patch embedded layer, a plurality of transducer coding modules and an output head module; Finally, sampling a calibration data set, namely sampling a plurality of latches from a training set subset of the pre-training model according to actual conditions, and executing preprocessing operation consistent with the model training stage on the sampled calibration data; The preprocessing comprises the steps of restoration, normalization and Patch Embedding-based image block division; S2, based on the step S1, executing a two-step mixed quantization strategy; Firstly, pre-screening each target network layer in a model based on cosine similarity, namely calculating cosine similarity between pseudo-quantization weight and full-precision weight of each layer under reference low-bit quantization, marking network layers with similarity lower than a preset threshold as layers sensitive to low-bit quantization, quickly constructing a quantization sensitive layer candidate set, then introducing a Hessian second-order information analysis method based on Hutchinson random estimation to the sensitive layer candidate set, finely evaluating loss sensitivity of each layer under quantization disturbance, and distributing differentiated mixed quantization bit widths for different network layers according to the cosine similarity, namely generating a mixed precision bit width configuration table; S3, calculating quantization parameters channel by channel based on the step S2; According to the quantization bit width of each layer determined in the step S2, calculating independent quantization scaling factors and zero offsets for the weight parameters and the activation channels respectively by taking the output channel as granularity, so as to relieve the influence of the numerical distribution difference among the channels on the quantization precision, namely calculating quantization parameters scale and zero_point for each Linear and Conv2d layer by taking the output channel as granularity, and for the weight Calculation of Individual quantization parameters; Wherein, the The real number domain is represented by the number, Respectively representing the number of output channels and the number of input channels; s4, carrying out special operator quantization processing based on the step S3; for the special operators in the Vision Transformer model, respectively designing customized quantization strategies according to the numerical distribution characteristics of the special operators; The method comprises the steps of performing scaling quantization on LayerNorm operators by using a scaling quantization method based on a power factor, and replacing multiplication operation by using displacement operation; S5, generating a quantization model and evaluating the quantization model based on the step S4; Based on the quantization configuration results of the steps S2 to S4, carrying out quantization replacement on the original Vision Transformer model to generate a quantization model, and completing quantization layer replacement and format conversion of the model; S6, storing and deploying the model which is evaluated to reach the standard based on the step S5; And uniformly storing the weight, quantization parameters and operator configuration of the standard quantization model, configuring a cooperative framework of the FPGA/embedded hardware, and realizing data interaction through an AXI protocol to finish deployment preparation.
- 2. The Vision Transformer model compression method based on the two-step hybrid quantization strategy according to claim 1, wherein the step S2 is specifically as follows: S21, loading the pre-training model prepared in the step S1 and a calibration data set; s22, setting a reference quantization bit width based on the step S21; Firstly, initializing unified quantization configuration, uniformly setting symmetrical quantization reference bit width for all layers to be quantized Initial cosine similarity threshold ; The layer to be quantized comprises a Linear layer and a Conv2d layer; Setting HESSIAN TRACE sampling times according to the variance theory of Hutchinson estimation The sensitivity classification threshold is initially set to , ; Wherein by means of , The high sensitivity layer, the medium sensitivity layer and the low sensitivity layer are respectively distributed with specific bit widths; s23, based on the step S22, performing pseudo quantization on the Linear/Conv layer; First at the reference bit width Performing quantization-inverse quantization operation on the weights to simulate real quantization noise, for the first part in the model A layer with a weight of And performing symmetric pseudo quantization according to the reference bit width, and determining quantization scale factors, wherein the expression is as follows: ; Wherein, the The real number domain is represented by the number, The number of output channels and the number of input channels of the layer are respectively represented, And then performing quantization operation to map the full-precision weight to the k-bit integer domain, wherein the expression is as follows: ; Where k-bits represent the quantization bit width, Represent the first The layer is quantized and then weighted, The representation is rounded off and rounded in shape, Operation limits overflow values to the k-bit signed integer range [ cavity [ ] , ],; Finally, inverse quantization is carried out to restore to a floating point domain, and pseudo quantization weight under k-bit quantization is obtained The expression is as follows: ; Wherein, the Representing pseudo quantized first Layer weights; S24, calculating the hierarchical quantization similarity based on the step S23; by measuring the vector included angle between the original weight and the quantized weight through cosine similarity and identifying the layer sensitive to the disturbance of the quantized direction, calculating the original weight layer by layer And pseudo-quantization weights The expression is as follows: ; Wherein, the Indicating that the full-precision weight and the pseudo-quantization weight are at the first Line 1 The product of the elements of the column locations, And Respectively represent the original weights And pseudo-quantization weights The L2 norm of (2); the quantization process definition expression is as follows: ; ; ; Wherein, the The quantization scale is indicated as such, Represents a zero _ point at which, Representing post-quantization weights for symmetric quantization =0, The computational expression is as follows: ; Wherein, the Representing a weight tensor Performing element-by-element absolute value taking operation, and judging whether the cosine similarity is lower than a threshold value If (if) < Indicating that the layer is sensitive to quantization disturbance, moving to step S25, the first step Layer joining sensitive layer candidate set If (if) ≥ Indicating that the layer is a non-sensitive layer, and turning to step S27 to allocate low bit width; S25, based on the step S24, all the requirements are met Layer index of conditions Joining sensitive layer candidate sets ; S26, performing Hessian-based fine sensitivity evaluation based on the step S25; For candidate layers In cross entropy loss function As an objective function, its Hessian matrix The second partial derivative matrix of the loss function on the layer parameter is defined as follows: ; Wherein, the Represent the first Layer candidate layer parameters; Then for the sensitive layer candidate set HESSIAN TRACE is calculated using the Hutchinson random estimation method, then the trace definition expression of the Hessian matrix is as follows: ; Wherein, the Represent the first Layer Hessian matrix Is the first of (2) The value of the characteristic is a value of, Representing cross entropy loss functions For the first Layer parameter vector Is the first of (2) Second partial derivatives of the individual elements; the Hutchinson random estimation algorithm is adopted, and high-efficiency calculation is completed based on a Hutchinson estimator by The sub-random sampling approximation is calculated as follows: ; Wherein, the The number of samples is indicated and, Representing index variable, value range , Represent the first Subsampling the resulting random variable, and ; Representing a transpose operation; s27, generating a mixed precision bit width configuration table based on the step S26; After HESSIAN TRACE is calculated in step S26, the bit width is allocated, that is, the preset threshold value is used to execute the three-level bit width mapping, and the expression is as follows: ; Wherein, the Represent the first The quantization bit width of the layer allocation.
- 3. The Vision Transformer model compression method based on the two-step hybrid quantization strategy according to claim 1, wherein the step S3 is specifically as follows: s31, calculating a weight channel quantization parameter; For the first Layer weight tensor Slicing according to dimension of output channel Individual sub-vectors The scale factor is calculated channel by channel, expressed as follows: ; Wherein, the Indicating the layer quantization bit width allocated in step S2, Represent the first The weight scale factors of the output channels are uniformly set to zero by zero offset; S32, calculating quantization parameters of an activation channel; using asymmetric quantization of the activation channels, forward propagation over the calibration data set, collecting the output activation values for each layer Channel-by-channel statistics of (i) that is, activation statistics using Min-Max accumulation for each channel The maximum and minimum values are accumulated over several latches, expressed as follows: ; Wherein, the The size of the batch is indicated and, The number of channels is indicated and the number of channels is indicated, The height of the feature map is indicated, The width of the feature map is indicated, ; And then, adopting symmetric quantization to avoid hardware overhead caused by calculating zero point offset, and calculating the symmetric quantization parameter by the following expression: ; Wherein, the Represent the first The activation value of the individual output channels is scaled by a factor, Represent the first Zero offset of the activation values of the output channels; s33, storing weight and activation quantization parameters; To be calculated to obtain And (3) with Bind with the corresponding channel and update to the quantization parameter dictionary for batch reading.
- 4. The Vision Transformer model compression method based on the two-step hybrid quantization strategy according to claim 1, wherein the step S4 is specifically as follows: S41, adopting a scaling quantization method based on a power factor for LayerNorm operators, and replacing multiplication operation with displacement operation; s411, loading LN layer input activation tensor; Load the first Input activation tensor of LayerNorm layers ; S412, calculating basic quantization parameters of the input characteristic channel; Activating tensors for inputs Performing channel-by-channel basic quantization preprocessing by first computing each channel Mean of (2) And standard deviation And then to the normalized features Calculate a base quantization parameter and quantize it into an integer tensor ; Wherein the basic quantization parameters, i.e. scale factors, include scaling factors Zero point ; S413, determining target quantization bit width And super parameter ; Setting target quantization bit width according to target quantization precision requirement And setting displacement constraint superparameter Wherein the parameters are super For defining a range of values for the exponent; s414, calculating a power exponent channel by channel; For the first Learning scaling parameters for individual Layer Norm layers Searching for a nearest power approximation to 2 channel by channel, for each channel The calculation expression is as follows: ; Wherein, the The representation is rounded to the nearest integer, Represent the first LayerNorm th layer The power exponent corresponding to each channel; Combining displacement constraint super-parameters To the power of the power The range constraint is performed with the following expression: ; S415, generating a channel-by-channel displacement factor table ; Based on constrained channel-by-channel power exponent Constructing a channel-by-channel displacement factor table , Represent the first LayerNorm th layer The displacement factors corresponding to the channels, all the displacement factors form a displacement factor table ; S416, a displacement factor table generated based on the step S416 Executing LayerNorm quantization strategies; basic quantization parameter and displacement factor table according to input characteristic channel Combining, from a displacement factor table Take out the current channel Is a displacement factor of (2) Performing LayerNorm computations in the integer domain; S417, replacing multiplication operation with channel-by-channel displacement operation; the original LayerNorm computational flow expression is as follows: ; Wherein, the The input feature vector representing the LayerNorm layers, Representing the mean and standard deviation of the feature vectors respectively, , Representing the learnable scaling parameters and biasing parameters, Representing the output characteristics after normalization, scaling and bias treatment; Then, the original floating-point number multiplication is modified and replaced by integer displacement operation, and then the LayerNorm calculation flow expression of the integer version is as follows: ; Wherein, the The integer output result is represented by LayerNorm, The quantization bias parameter is represented as a function of, Indicating that the arithmetic is shifted to the right, Representing a current channel The displacement factor alpha of the displacement factor is searched to realize quick calculation; S418, outputting LayerNorm the quantized result ; LayerNorm outputs that will generate integer fields Calculation for a transducer layer; s42, adopting a quantization method based on integer approximation of a logarithmic domain for the Softmax operator, and realizing approximation index calculation in an integer lookup table mode; S421, quantization 、 、 A feature matrix; original floating point query matrix for converting attention module Key matrix Numerical matrix Respectively carrying out fixed-point quantization processing to obtain corresponding INTk representation ; S422, calculating an attention score matrix based on integer operation; First, a quantized attention score matrix is calculated, and the expression is as follows: ; Wherein, the Representing the quantized attention score matrix, Representing the quantized query matrix, Representing a transpose of the quantized key matrix, the transpose operation being used to match the dimensions of the matrix multiplication; s423, constructing an index approximation lookup table; off-line construction of an index approximation look-up table The method is used for storing the fixed-point approximate result of the exponential function, and the expression is as follows: ; Wherein, the Representing a look-up table index; Representing an exponential offset constant; Representing a scale parameter; representing index bit width, and the lookup table entry is stored in fixed-point format and covered To the point of Is a exponential approximation of (2); s424, performing index approximation calculation based on the lookup table; inputting integer values according to Softmax The corresponding index approximation result is obtained through a lookup table, and the original Softmax definition expression is as follows: ; Wherein, the Representing the first query vector in the attention score matrix The value of the correlation score is calculated, Other relevance score values corresponding to the same query vector; the numerator denominator is taken as natural logarithm, and the index sum is converted into the logarithmic domain subtraction, and the expression is as follows: ; Score attention points Quantized to k-bit integers As the input index of the lookup table, when the model is inferred, the integer index approximation is obtained through the lookup table, and the expression is as follows: ; S425, normalization calculation based on displacement operation; A shift-based normalization operation is performed on the exponential approximation, expressed as follows: ; ; Wherein, the Represents the number of normalized displacement bits, An attention weight matrix representing the integer domain, Representing quantization to k-bit integers ; S426, based on Calculating and outputting the weighted summation of the matrixes; the normalized attention weight and the numerical matrix And carrying out integer multiplication and addition operation, wherein the expression is as follows: ; Wherein, the Attention calculations representing k-bit integer precision.
- 5. The Vision Transformer model compression method based on the two-step hybrid quantization strategy according to claim 1, wherein the step S5 is specifically as follows: S51, constructing a quantization model; Based on a full-precision Vision Transformer model and quantization configuration, replacing a target network layer with a quantization layer, integrating LayerNorm shift quantization and integer calculation logic of a Softmax logarithmic domain approximation special operator, and generating an initial quantization model; S52, exporting a quantization model and converting a format; Converting the constructed quantization model into a format in which hardware can be directly deployed, ensuring that the model contains complete quantization parameters and operator logic, and adapting to the computing architecture of target hardware; s53, performing calibration set performance evaluation based on the quantitative model obtained in the steps S51-S52; performing inference evaluation on the quantization model by using the calibration data set to obtain a multi-dimensional performance index, including: (1) Calculating the Top-1/Top-5 accuracy and cosine similarity, and evaluating the precision loss of the quantized model; (2) Performance indexes, namely statistical reasoning delay, memory occupation and hardware energy consumption, and evaluating the deployment efficiency of the model; (3) Stability verification, namely testing the reasoning stability of the model under different batch size and input resolution; S54, performing quantization model verification and iterative optimization based on the multi-dimensional performance index obtained in the step S53; comparing the index obtained by the evaluation in the step S53 with a preset target threshold, entering the step S6 if the index meets the requirement, returning to the step S3 to the step S4, adjusting the quantization configuration, returning to the step S5, and reconstructing the quantization model until the index meets the standard.
Description
Vision Transformer model compression method based on two-step mixed quantization strategy Technical Field The invention belongs to the technical field of computer vision and neural network compression, and particularly relates to a Vision Transformer model compression method based on a two-step mixed quantization strategy. Background Along with the wide application of Vision Transformer (ViT) models in the field of computer vision, the models acquire excellent performances in tasks such as image classification, image segmentation and the like. However, viT has a huge parameter (for example, viT-Base contains 86M parameters, and ViT-Huge exceeds 600M parameters), so that the ViT is difficult to deploy in the environment with limited edge equipment and resources, and the use of the ViT in mobile terminals, embedded systems and real-time applications is severely restricted. Model quantization is an effective model compression technique, and by converting model weights and activation values from high-precision floating point numbers (such as FP 32) to low-precision integers (such as INT8, INT 4), model deployment time, model size, acceleration reasoning speed and energy consumption can be significantly reduced. The existing quantization method for Vision Transformer models has the following problems when applied to: 1. The large inter-channel difference results in serious precision loss, and the self-attention mechanism of Vision Transformer and LayerNorm operation result in significant numerical distribution difference among different channels, so that the conventional layer-by-layer (Per-Tensor) quantization method can cause serious quantization errors by using uniform quantization parameters, and thus, serious precision loss can be caused. 2. The sensitive layer identification efficiency is low, and most of the existing methods use a single index (such as only based on a weight gradient or only based on a Hessian matrix) to identify the quantized sensitive layer, but the single index has the limitation of insufficient identification precision. Gradient-based methods are simple but not accurate enough, while Hessian-based methods are accurate but computationally expensive and are not feasible for quantitative deployment of large-scale models. 3. The quantization difficulty of the special operators is that Vision Transformer contains a large number of special operators (such as Softmax, GELU, layerNorm in a model), the output distribution of the operators is complex (such as Softmax output is bimodal distribution), and the existing uniform quantization method cannot effectively process the distribution characteristics, so that special processing is required for the special operators. In view of the foregoing, there is a need for a Vision Transformer quantization method that can efficiently identify sensitive layers, accommodate inter-channel differences, handle special operators, and have good hardware compatibility. Disclosure of Invention In order to solve the technical problems, the invention provides a Vision Transformer model compression method based on a two-step mixed quantization strategy, which is used for efficiently identifying a sensitive layer through a two-step strategy combining cosine similarity quick preliminary screening and Hessian fine analysis, utilizing channel-by-channel quantization to adapt to the inter-channel difference of Vision Transformer, integrating a plurality of advanced quantization technologies to process special operators, and realizing the reduction of the memory occupation of a model and the acceleration of the reasoning expense of the model on the premise of keeping high precision. The technical scheme adopted by the invention is that the Vision Transformer model compression method based on a two-step mixed quantization strategy comprises the following specific steps: S1, preparing a pre-training model and a calibration data set; A full-precision Vision Transformer model trained on a large-scale dataset is first prepared, the Vision Transformer model comprising an embedded layer, a multi-layer transducer encoding module, and an output header module. Wherein the Vision Transformer model comprises ViT-Base, viT-Larget, deiT-Base and SAM-Huge. And then, confirming a model structure, initializing quantitative configuration parameters, namely analyzing the types and the parameter scales of all layers in the model, identifying all Linear layers, conv2d layers, layerNorm layers and Softmax layers, and definitely determining the core structure of the loaded model, wherein the core structure comprises a Patch embedded layer, a plurality of transducer coding modules and an output header module. And finally, sampling the calibration data set, namely sampling a plurality of latches from a training set subset of the pre-training model according to actual conditions, and executing preprocessing operation consistent with the model training stage on the sampled calibration data. Wherein