CN-121981168-A - Model parameter compression method, device and equipment for large language model and storage medium

CN121981168ACN 121981168 ACN121981168 ACN 121981168ACN-121981168-A

Abstract

The embodiment of the application provides a model parameter compression method, device and equipment of a large language model and a storage medium. The method comprises the steps of obtaining verification data to be input into a large language model to obtain input activation tensors received by each network layer, obtaining inference confusion degree of the large language model for reasoning the verification data, taking the minimum inference confusion degree as an optimization target, conducting iterative clipping decision to obtain clipping decision vectors, conducting outlier clipping processing on the input activation tensors of the network layers based on the clipping decision vectors to obtain target activation tensors, calculating importance scores of model parameters based on the target activation tensors for each network layer, pruning initial model parameter tensors to obtain intermediate model parameter tensors by combining the importance scores, and conducting quantization processing on the intermediate model parameter tensors of the network layers to obtain the target large language model. Therefore, the storage and calculation complexity of the large language model can be reduced while the performance of the large language model is maintained.

Inventors

Jia Wanyi
MA ZHENGYU
YU LIUTAO
ZHOU HUIHUI

Assignees

鹏城实验室

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A method for compressing model parameters of a large language model, the method comprising: Acquiring verification data, inputting the verification data into a preset large language model for forward propagation to acquire input activation tensors received by each network layer in the large language model; acquiring the inference confusion degree of the large language model for reasoning the verification data, and carrying out iterative clipping decision of the input activation tensor of each network layer by taking the inference confusion degree as an optimization target to obtain clipping decision vectors corresponding to a plurality of network layers; based on the clipping decision vector, performing outlier clipping processing on the input activation tensor of the corresponding network layer in the large language model to obtain a target activation tensor of each processed network layer; For each network layer of the large language model, acquiring an initial model parameter tensor, calculating an importance score of each model parameter in the initial model parameter tensor based on the corresponding target activation tensor, and carrying out model parameter pruning on the initial model parameter tensor by combining the importance score of each model parameter to obtain an intermediate model parameter tensor; And carrying out quantization processing on the intermediate model parameter tensors corresponding to each network layer of the large language model to obtain a target large language model.
2. The method for compressing model parameters of a large language model according to claim 1, wherein the performing iterative clipping decision of the input activation tensor of each network layer with the inference confusion degree minimized as an optimization target to obtain clipping decision vectors corresponding to a plurality of network layers comprises: constructing initial clipping decision vectors corresponding to a plurality of network layers of the large language model; Based on the initial clipping decision vector, clipping the input activation tensor of the network layer corresponding to the large language model, and then operating the large language model to obtain the inference confusion degree of the large language model for the verification data; the inference confusion degree of the large language model is minimized to serve as an optimization target, the initial clipping decision vector is updated based on the inference confusion degree, and an updated clipping decision vector is obtained; cutting the input activation tensors of a plurality of network layers of the large language model according to the updated cutting decision vector, operating the large language model to obtain reasoning confusion, and updating based on the reasoning confusion to obtain the updated cutting decision vector; repeatedly executing the steps of updating the clipping decision vector according to the updated clipping decision vector, clipping the input activation tensors of a plurality of network layers of the large language model, operating the large language model to obtain reasoning confusion, updating the updated clipping decision vector based on the reasoning confusion until the preset iteration times are reached, and obtaining clipping decision vectors corresponding to the plurality of network layers.
3. The method for compressing model parameters of a large language model according to claim 2, wherein said updating said initial clipping decision vector based on said inference confusion results in an updated clipping decision vector, comprising: Acquiring an initial speed vector corresponding to the initial clipping decision vector; Taking the initial clipping decision vector as a single particle in a particle cluster, and calculating a current individual historical optimal decision vector of the single particle and a global historical optimal decision vector of the particle cluster based on the inference confusion degree of a plurality of particles in the particle cluster; updating the initial velocity vector of the single particle based on the individual historical optimal decision vector and the global historical optimal decision vector to obtain a next velocity vector; And carrying out probability mapping based on the next speed vector to obtain an updated clipping decision vector.
4. The method for compressing model parameters of a large language model according to claim 1, wherein the performing outlier clipping processing on the input activation tensor of the corresponding network layer in the large language model based on the clipping decision vector to obtain the processed target activation tensor of each network layer comprises: Determining a network layer to be clipped from the large language model based on the clipping decision vector; acquiring a preset lower bound proportion threshold value of an activation value and an upper bound proportion threshold value of the activation value for each network layer to be cut; Determining a corresponding activation value lower bound threshold based on the activation value lower bound proportion threshold and a corresponding activation value upper bound threshold based on the activation value upper bound proportion threshold for the input activation tensor of the network layer to be clipped; In the large language model, for the input activation tensor of each network layer to be clipped, a first activation value smaller than the activation value lower bound threshold is adjusted to be the activation value lower bound threshold, and a second activation value larger than the activation value upper bound threshold is adjusted to be the activation value upper bound threshold, so that the target activation tensor of each network layer after processing is obtained.
5. The model parameter compression method of a large language model according to claim 1, wherein the calculating the importance score of each model parameter in the initial model parameter tensor based on the corresponding target activation tensor comprises: Determining an activation value set corresponding to each input channel of the current network layer and a target norm value corresponding to the activation value set based on the target activation tensor; calculating corresponding square parameter values for each model parameter contained in the initial model parameter tensor of the network layer; for each model parameter in the initial model parameter tensor, a corresponding target input channel is determined, and a corresponding importance score is calculated based on the square parameter value and the target norm value corresponding to the target input channel.
6. The method for compressing model parameters of a large language model according to claim 1, wherein said quantizing the intermediate model parameter tensor corresponding to each network layer of the large language model to obtain a target large language model comprises: Determining output feature dimensions corresponding to intermediate model parameter tensors corresponding to each network layer of the large language model; dividing the intermediate model parameter tensor into a plurality of vector blocks according to the output characteristic dimension; And in each network layer of the large language model, respectively carrying out quantization processing on each vector block to obtain a target large language model.
7. The method for compressing model parameters of a large language model according to claim 6, wherein in each network layer of the large language model, quantization is performed on each vector block to obtain a target large language model, respectively, comprising: in each network layer of the large language model, determining a corresponding maximum floating point value and a corresponding minimum floating point value for each vector block of the included intermediate model parameter tensor, and calculating a corresponding scale factor based on the maximum floating point value and the minimum floating point value; calculating the ratio between each floating point value in each vector block and the corresponding scale factor to obtain a target ratio; Obtaining a zero point corresponding to each vector block, and carrying out quantization conversion calculation on the corresponding floating point value based on the target ratio and the zero point to obtain an initial quantization value; processing the initial quantized value through a preset cut-off function to obtain a target quantized value corresponding to each floating point value; and after each floating point value of each vector block contained in each network layer of the large language model is quantized, obtaining a target large language model.
8. A model parameter compression apparatus for a large language model, the apparatus comprising: The acquisition module is used for acquiring the check data, inputting the check data into a preset large language model for forward propagation so as to acquire an input activation tensor received by each network layer in the large language model; The decision module is used for acquiring the inference confusion degree of the large language model for reasoning the verification data, and carrying out iterative clipping decision of the input activation tensor of each network layer by taking the inference confusion degree as an optimization target to obtain clipping decision vectors corresponding to a plurality of network layers; The clipping module is used for clipping the outlier of the input activation tensor of the corresponding network layer in the large language model based on the clipping decision vector to obtain the target activation tensor of each network layer after processing; the pruning module is used for acquiring an initial model parameter tensor aiming at each network layer of the large language model, calculating the importance score of each model parameter in the initial model parameter tensor based on the corresponding target activation tensor, and carrying out model parameter pruning on the initial model parameter tensor by combining the importance score of each model parameter to obtain an intermediate model parameter tensor; and the quantization module is used for carrying out quantization processing on the intermediate model parameter tensor corresponding to each network layer of the large language model to obtain the target large language model.
9. A computer device, characterized in that it comprises a memory storing a computer program and a processor implementing a model parameter compression method of a large language model according to any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the model parameter compression method of a large language model according to any one of claims 1 to 7.

Description

Model parameter compression method, device and equipment for large language model and storage medium Technical Field The present application relates to the field of data compression technologies, and in particular, to a method, an apparatus, a device, and a storage medium for compressing model parameters of a large language model. Background In recent years, a large language model shows excellent performance in natural language understanding and generating tasks, but the huge parameter and calculation amount of the large language model also bring about remarkable storage overhead and reasoning delay. In order to achieve efficient deployment in a resource constrained environment, data compression of the model is required to reduce storage requirements and computational complexity while maintaining its performance as much as possible. In the related art, data compression is generally achieved by performing thinning processing and quantization processing on model parameters. Specifically, part of network connection is cut off according to the magnitude of the weight amplitude, and then integer approximation representation of the unified scale is carried out on the residual parameters to obtain a compressed model. However, since the two processes are independent of each other and lack of synergy, the clipping stage may remove parameters critical to the conversion of the subsequent representation, resulting in a failure to accurately adapt the conversion stage to the distribution characteristics of the retained parameters, thereby introducing a large approximation error, and eventually resulting in a significant degradation of the performance of the compressed model. Disclosure of Invention The application provides a model parameter compression method, device and equipment for a large language model and a storage medium, which can reduce the storage and calculation complexity of the large language model while maintaining the performance of the large language model. In order to achieve the above object, a first aspect of an embodiment of the present application provides a method for compressing model parameters of a large language model, the method comprising: Acquiring verification data, inputting the verification data into a preset large language model for forward propagation to acquire input activation tensors received by each network layer in the large language model; acquiring the inference confusion degree of the large language model for reasoning the verification data, and carrying out iterative clipping decision of the input activation tensor of each network layer by taking the inference confusion degree as an optimization target to obtain clipping decision vectors corresponding to a plurality of network layers; based on the clipping decision vector, performing outlier clipping processing on the input activation tensor of the corresponding network layer in the large language model to obtain a target activation tensor of each processed network layer; For each network layer of the large language model, acquiring an initial model parameter tensor, calculating an importance score of each model parameter in the initial model parameter tensor based on the corresponding target activation tensor, and carrying out model parameter pruning on the initial model parameter tensor by combining the importance score of each model parameter to obtain an intermediate model parameter tensor; And carrying out quantization processing on the intermediate model parameter tensors corresponding to each network layer of the large language model to obtain a target large language model. Accordingly, a second aspect of an embodiment of the present application provides a model parameter compression apparatus for a large language model, the apparatus including: The acquisition module is used for acquiring the check data, inputting the check data into a preset large language model for forward propagation so as to acquire an input activation tensor received by each network layer in the large language model; The decision module is used for acquiring the inference confusion degree of the large language model for reasoning the verification data, and carrying out iterative clipping decision of the input activation tensor of each network layer by taking the inference confusion degree as an optimization target to obtain clipping decision vectors corresponding to a plurality of network layers; The clipping module is used for clipping the outlier of the input activation tensor of the corresponding network layer in the large language model based on the clipping decision vector to obtain the target activation tensor of each network layer after processing; the pruning module is used for acquiring an initial model parameter tensor aiming at each network layer of the large language model, calculating the importance score of each model parameter in the initial model parameter tensor based on the corresponding target activation tensor, and carrying out