CN-121981279-A - Large language model quantization method, device, electronic equipment and storage medium

CN121981279ACN 121981279 ACN121981279 ACN 121981279ACN-121981279-A

Abstract

The invention discloses a quantization method, a quantization device, electronic equipment and a storage medium of a large language model, wherein the method comprises the steps of respectively adaptively determining respective target quantization granularity for each quantization layer of a designated large language model; based on each target quantization granularity, performing grouping quantization on the large language model under the first quantization precision to obtain a first quantization model, utilizing a preset verification data set to verify the inference precision of the first quantization model on target hardware equipment, identifying a sensitive layer and adjusting the quantization precision of the sensitive layer from the first quantization precision to the second quantization precision to obtain a second quantization model if the inference precision does not reach a preset threshold, verifying the inference precision of the second quantization model, and iteratively executing the identification and quantization precision adjustment operation of the sensitive layer according to a verification result until the inference precision reaches the preset threshold, and outputting a final quantization model. The invention can give consideration to the reasoning precision and hardware efficiency of the large language model in actual deployment.

Inventors

ZHANG HAO

Assignees

江苏清微智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. A method for quantifying a large language model, comprising: each quantifiable layer of the appointed large language model is respectively and adaptively determined to each target quantification granularity; performing group quantization on the large language model under a first quantization accuracy based on target quantization granularity respectively adapted to each quantized layer to obtain a first quantization model; Carrying out reasoning accuracy verification on the first quantization model on target hardware equipment by using a preset verification data set; If the reasoning precision does not reach the preset threshold, identifying a sensitive layer, and adjusting the quantization precision of the sensitive layer from the first quantization precision to a second quantization precision to obtain a second quantization model; And verifying the reasoning precision of the second quantization model, and iteratively executing the sensitive layer identification and quantization precision adjustment operation according to the verification result until the reasoning precision reaches a preset threshold value, and outputting a final quantization model.
2. The method for quantizing a large language model according to claim 1, wherein said adaptively determining the target quantization granularity for each of the quantized layers of the specified large language model comprises: Traversing a plurality of preset candidate quantization granularities for any one of the quantifiable layers; Performing grouping quantization on any quantized layer under a first quantization precision for any candidate quantization granularity, and calculating a precision efficiency joint optimization index AETS corresponding to the any candidate quantization granularity; and selecting AETS candidate quantization granularities meeting the preset AETS condition as target quantization granularities corresponding to any one of the quantifiable layers.
3. The large language model quantization method according to claim 2, wherein for any candidate quantization granularity, performing packet quantization on any one of the quantifiable layers at a first quantization precision, and calculating a precision efficiency joint optimization index AETS corresponding to the any candidate quantization granularity, includes: Performing grouping quantization on any one of the quantifiable layers under the first quantization precision aiming at any one of the candidate quantization granularities to obtain a quantization model under any one of the candidate quantization granularities; Calculating the reasoning precision loss and hardware overhead corresponding to any candidate quantization granularity based on a preset check data set and a quantization model under the any candidate quantization granularity; And determining AETS corresponding to any candidate quantization granularity based on the inference precision loss and the hardware overhead.
4. A method of quantization of large language models according to claim 3, wherein said performing, for any candidate quantization granularity, packet quantization on said any quantifiable layer with a first quantization precision, results in a quantization model at any candidate quantization granularity, comprising: Dividing a weight matrix corresponding to any one of the quantifiable layers into a plurality of groups by columns based on any one of the candidate quantization granularities; for any group, calculating the maximum value of the absolute value of the weight element in the group as a scaling factor; Performing quantization on the weights in each group with a first quantization accuracy using the scaling factors of each group to obtain a quantization model at the arbitrary candidate quantization granularity; based on a preset check data set and a quantization model under any candidate quantization granularity, calculating the precision loss corresponding to the candidate quantization granularity comprises the following steps: Respectively performing forward reasoning on the quantization model under any candidate quantization granularity and the specified large language model by using a preset calibration data set to acquire corresponding quantization output distribution and original output distribution; calculating a KL-divergence between the quantized output distribution and the original output distribution; determining the loss of precision based on the KL divergence; Based on a preset check data set and a quantization model under any candidate quantization granularity, calculating hardware overhead corresponding to the candidate quantization granularity, including: determining a total number of scaling factors required for the any one of the quantifiable layers according to the any one of the candidate quantization granularities; obtaining a kernel efficiency penalty factor when any candidate quantization granularity runs on target hardware; And determining the hardware overhead based on the weighted summation result of the total number of the scaling factors and the kernel efficiency penalty factor.
5. The method for quantizing a large language model according to claim 1, wherein said performing inference accuracy verification on the second quantized model, and iteratively performing sensitive layer recognition and quantization accuracy adjustment operations according to the verification result until the inference accuracy reaches a preset threshold, and outputting a final quantized model, comprises: if the reasoning precision of the second quantization model does not reach a preset threshold value or the number of layers with the increased quantization precision does not exceed the upper limit of the preset number of layers, continuously identifying a new sensitive layer and increasing the quantization precision of the new sensitive layer, generating an updated second quantization model, and carrying out reasoning precision verification again; And outputting a final quantization model if the reasoning precision of the second quantization model reaches a preset threshold value or the number of layers with the improved quantization precision exceeds the upper limit of the number of layers of the second quantization model.
6. The large language model quantization method according to any one of claims 1 to 5, wherein said identifying a sensitive layer comprises: Calculating a weight perception calibration ratio WACR of each quantifiable layer based on the weight matrix of each quantifiable layer; Sequencing the quantifiable layers according to WACR, and selecting the quantifiable layers with the minimum WACR value and the preset quantity or the preset proportion to determine the quantifiable layers as sensitive layers; The adjusting the quantization accuracy of the sensitive layer from the first quantization accuracy to a second quantization accuracy includes: And improving the quantization precision of the sensitive layer from the first quantization precision to a second quantization precision, wherein the second quantization precision is a multiple of the first quantization precision.
7. A large language model quantization apparatus, comprising: A selection module configured to adaptively determine respective target quantization granularities for respective quantifiable layers of a specified large language model; A quantization module configured to perform a group quantization on the large language model at a first quantization accuracy based on a target quantization granularity to which each of the quantifiable layers is adapted to obtain a first quantization model; the verification module is configured to verify the reasoning accuracy of the first quantization model on the target hardware device by utilizing a preset verification data set; The identification module is configured to identify a sensitive layer and adjust the quantization precision of the sensitive layer from the first quantization precision to the second quantization precision if the inference precision does not reach a preset threshold value so as to obtain a second quantization model; and the output module is configured to perform reasoning accuracy verification on the second quantization model, iteratively execute sensitive layer identification and quantization accuracy adjustment operation according to a verification result until the reasoning accuracy reaches a preset threshold value, and output a final quantization model.
8. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.

Description

Large language model quantization method, device, electronic equipment and storage medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for quantifying a large language model, an electronic device, and a storage medium. Background As the scale of large language model parameters continues to grow, the inference deployment faces the dual challenges of "video memory wall" and "precision collapse". In order to solve the above problems, the industry generally performs quantization processing on a large language model. Currently, multiple quantifiable layers of large language models are typically quantized with a fixed quantization granularity at the same quantization accuracy. However, if all the quantifiable layers adopt the same quantization precision or fixed quantization granularity, part of the quantifiable layers lose the inference precision due to too coarse grouping or too fine grouping wastes hardware resources, so that the hardware efficiency is low, and the inference precision and the hardware efficiency of the large language model in actual deployment are unbalanced. Disclosure of Invention The invention provides a large language model quantification method, a large language model quantification device, electronic equipment and a storage medium. The method mainly aims to adaptively select quantization granularity layer by layer and dynamically adjust quantization precision, so that the reasoning precision and hardware efficiency of a large language model in actual deployment can be effectively considered. According to a first aspect of the present invention, there is provided a large language model quantization method, comprising: each quantifiable layer of the appointed large language model is respectively and adaptively determined to each target quantification granularity; performing packet quantization on the large language model at a first quantization accuracy based on the target quantization granularity to obtain a first quantization model; Carrying out reasoning accuracy verification on the first quantization model on target hardware equipment by using a preset verification data set; if the reasoning precision does not reach the preset threshold, identifying a sensitive layer, and improving the quantization precision of the sensitive layer to a precision grade higher than the first quantization precision so as to obtain a second quantization model; And verifying the reasoning precision of the second quantization model, and iteratively executing the sensitive layer identification and quantization precision adjustment operation according to the verification result until the reasoning precision reaches a preset threshold value, and outputting a final quantization model. Optionally, each quantifiable layer of the specified large language model adaptively determines a respective target quantization granularity, comprising: Traversing a plurality of preset candidate quantization granularities for any one of the quantifiable layers; Performing packet quantization under a first quantization precision on any one quantized layer according to any one candidate quantization granularity, and calculating a precision efficiency joint optimization index AETS corresponding to the any one candidate quantization granularity; and selecting AETS candidate quantization granularities meeting the preset AETS condition as target quantization granularities corresponding to any one of the quantifiable layers. Optionally, for any candidate quantization granularity, performing packet quantization on any one of the quantifiable layers under a first quantization precision, and calculating a precision efficiency joint optimization index AETS corresponding to the any one of the quantifiable layers, including: Performing grouping quantization on any one of the quantifiable layers under the first quantization precision aiming at any one of the candidate quantization granularities to obtain a quantization model under any one of the candidate quantization granularities; Calculating the precision loss and hardware cost corresponding to any candidate quantization granularity based on a preset check data set and a quantization model under the any candidate quantization granularity; and determining AETS corresponding to any candidate quantization granularity based on the precision loss and the hardware overhead. Optionally, for any candidate quantization granularity, performing packet quantization on any quantized layer under the first quantization precision to obtain a quantization model under any candidate quantization granularity, including: Dividing a weight matrix corresponding to any one of the quantifiable layers into a plurality of groups by columns based on any one of the candidate quantization granularities; for any group, calculating the maximum value of the absolute value of the weight element in the group as a scaling factor; Perfo