CN-122021745-A - Quantization method, system, medium and equipment for mixed expert large language model

CN122021745ACN 122021745 ACN122021745 ACN 122021745ACN-122021745-A

Abstract

The application provides a quantization method, a quantization system, a quantization medium and quantization equipment for a mixed expert large language model, wherein the method comprises the steps of forward spreading the mixed expert large language model to be quantized and determining an activation value of each layer; the method comprises the steps of determining Fisher information of each expert weight according to an activation value of each layer, carrying out linear weighted fusion on each expert weight according to the Fisher information of each expert weight to determine a shared basic weight, taking a difference value between each expert weight and the shared basic weight as a residual weight of each expert, carrying out weight binarization processing on the shared basic weight by adopting a preset alternative refining binarization strategy, carrying out weight binarization processing on the residual weight of each expert by adopting a mode of combining a symbol matrix and a scaling factor, and determining a quantized mixed expert large language model. According to the application, the effective separation and compression of expert commonality and personalized information are realized, the compression precision and compression rate are effectively balanced, and the calculation cost and the deployment cost are reduced.

Inventors

ZHANG YULUN
Zhang Tianao
KONG LINGHE
YANG XIAOKANG

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260109

Claims (10)

1. A method for quantifying a hybrid expert large language model, comprising: inputting a preset number of input samples into a mixed expert large language model to be quantized, performing forward propagation, and determining an activation value of each layer of the mixed expert large language model to be quantized; Determining Fisher information of each expert weight according to the activation value of each layer; according to Fisher information of each expert weight, carrying out linear weighted fusion on each expert weight to determine a shared basic weight; Taking the difference value of the weight of each expert and the shared basic weight as the residual weight of each expert; And carrying out weight binarization processing on the shared basic weight by adopting a preset alternative refining binarization strategy, carrying out weight binarization processing on the residual weight of each expert by adopting a mode of combining a symbol matrix and a scaling factor, and determining a quantized mixed expert large language model.
2. The method for quantizing a hybrid expert large language model according to claim 1, further comprising: according to the Fisher information of each expert weight, carrying out importance analysis on the residual error weight of each expert, and determining the importance of the residual error weight of each expert; And according to the importance of the residual weight of each expert, carrying out binarization order allocation on the residual weight of each expert, and determining the binarization order of the residual weight of each expert.
3. The method for quantizing a hybrid expert large language model according to claim 1, wherein determining Fisher information of each expert weight according to the activation value of each layer comprises: constructing gradient approximation of each expert weight according to the input activation of each layer and the output activation of each expert; Taking the element-by-element square of the gradient approximation of each expert weight as a weight sensitivity preliminary estimate of each expert weight; normalizing the weight sensitivity preliminary estimation of each expert weight by adopting the integral amplitude of the input activation of each layer, and determining the weight sensitivity estimation result of each expert weight; And averaging the weight sensitivity estimation results of each expert weight, and determining Fisher information of each expert weight.
4. The method for quantizing a hybrid expert large language model according to claim 1, wherein the weighting binarization processing is performed on the shared base weight by using a preset alternate refined binarization strategy, and the method comprises: Performing primary binarization processing on the shared basic weight to generate a first symbol matrix, an initial first row scaling factor and an initial first column scaling factor; determining an initial binary approximation representation of the shared basis weight based on the first symbol matrix, the initial first row scaling factor, and the initial first column scaling factor; constructing a first residual term according to a reconstruction error between the shared basis weight and the initial binarized approximate representation; In the next round of alternate refining process, the first residual error item in the previous round of alternate refining process is taken as an optimization target, a greedy parameter updating strategy is adopted to select the updating direction with the largest descending amplitude of the first residual error item, the first symbol matrix, the initial first row scaling factor and the initial first column scaling factor are updated, and a new first symbol matrix, a new first row scaling factor and a new first column scaling factor are determined; determining a new binary approximation representation of the shared basis weight in the next round of alternate refining based on the new first symbol matrix, the new first row scaling factor, and the new first column scaling factor; In the next round of alternate refining, constructing a new first residual term according to a reconstruction error between the shared base weight and the new binarized approximate representation; And cycling the multi-round alternate refining process until the difference value between the new binarized approximate representations of the shared basic weights of the adjacent two-round alternate refining processes is smaller than a preset first critical value or reaches a preset first iteration round, and determining the binarized processing result of the shared basic weights.
5. The method for quantizing the mixed expert large language model according to claim 2, wherein the step of performing weight binarization processing on the residual weight of each expert by combining a sign matrix with a scaling factor comprises: Performing primary binarization processing on the residual error weight of each expert to generate a second symbol matrix, an initial second row scaling factor and an initial second column scaling factor; Determining an initial binary approximation representation of the residual weights of each of the experts based on the second symbol matrix, the initial second row scaling factor, and the initial second column scaling factor; Constructing a second residual term according to a reconstruction error between the residual weight of each expert and the initial binarized approximate representation of the residual weight of each expert; in the next round of alternate refining process, the second residual error item in the previous round of alternate refining process is taken as an optimization target, a greedy parameter updating strategy is adopted to select an updating direction with the largest descending amplitude of the second residual error item, the second symbol matrix, the initial second row scaling factor and the initial second column scaling factor are updated, and a new second symbol matrix, a new second row scaling factor and a new second column scaling factor are determined; determining a new binary approximation representation of the residual weight of each of said experts during said next round of alternate refining based on said new second symbol matrix, said new second row scaling factor, and said new second column scaling factor; in the next round of alternate refining, constructing a new second residual term according to a reconstruction error between the residual weight of each expert and a new binary approximation representation of the residual weight of each expert; and cycling the multiple rounds of alternate refining processes until the difference between new binarized approximate representations of the residual weights of each expert in the adjacent two rounds of alternate refining processes is smaller than a preset second critical value or reaches a preset second iteration round, and determining a binarized processing result of the residual weights of each expert.
6. A quantization system for a hybrid expert large language model, comprising: the system comprises an activation value acquisition module, a data acquisition module and a data processing module, wherein the activation value acquisition module is used for inputting a preset number of input samples into a mixed expert large language model to be quantized, performing forward propagation and determining an activation value of each layer of the mixed expert large language model to be quantized; the Fisher information extraction module is used for determining Fisher information of each expert weight according to the activation value of each layer; The shared basic weight determining module is used for carrying out linear weighted fusion on each expert weight according to Fisher information of each expert weight to determine the shared basic weight; the residual weight determining module is used for taking the difference value of each expert weight and the shared basic weight as the residual weight of each expert; And the weight binarization processing module is used for carrying out weight binarization processing on the shared basic weight by adopting a preset alternate refining binarization strategy, carrying out weight binarization processing on the residual weight of each expert by adopting a mode of combining a symbol matrix and a scaling factor, and determining a quantized mixed expert large language model.
7. A method for generating text based on a hybrid expert large language model, comprising: acquiring a natural language prompt, a context sequence, an instruction text and a historical dialogue and text to be quantified to generate a model; performing quantization processing on the text generation model to be quantized by adopting the quantization method of the mixed expert large language model according to any one of the claims 1-5, and determining a quantized text generation model; Inputting the natural language prompt, the context sequence, the instruction text and the historical dialog into the quantized text generation model to determine a text generation result.
8. A knowledge extraction method based on a hybrid expert large language model, comprising: acquiring a text to be analyzed, a factual problem, an entity relation query, a search type prompt and a context text and knowledge extraction model; performing quantization processing on the knowledge extraction model by adopting the quantization method of the mixed expert large language model according to any one of the claims 1-5, and determining a quantized knowledge extraction model; Inputting the text to be analyzed, the factual problem, the entity relation query, the search room prompt and the context text into the quantized knowledge extraction model, and determining extracted knowledge.
9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor realizes the steps of the method according to any of claims 1-8.
10. An electronic device, comprising: A memory having a computer program stored thereon; A processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-8.

Description

Quantization method, system, medium and equipment for mixed expert large language model Technical Field The application relates to the technical field of model quantization, in particular to a quantization method, a quantization system, a quantization medium and quantization equipment for a mixed expert large language model. Background In recent years, with the development and continuous evolution of the transducer architecture, a Large Language Model (LLMs) has made a breakthrough in the field of Natural Language Processing (NLP). The scale of the representative models from BERT series to LLaMA is rapidly expanded from hundreds of millions to billions, and the performance upper limit of a plurality of tasks such as text generation, question-answering, translation, code generation and the like is improved. This wave model driven paradigm reform benefits from a very strong fit of very large scale model parameters to linguistic knowledge and context understanding capabilities. However, the jump in performance also comes with significant computational and resource costs. The current mainstream Large Language Model (LLMs) puts very high demands on memory and computing power in the training and reasoning process. For example, LLaMA-70B requires more than 150GB of memory at the time of reasoning, far exceeding the upper limit of availability of most devices. As model layers, dimensions, and context lengths continue to expand, large Language Models (LLMs) gradually expose significant deployment bottlenecks in multiple rounds of interactions, long text generation, low latency responses, and other tasks. Particularly in mobile terminals, edge devices and low power consumption environments, the storage volume and the operation cost of the model become core barriers for restricting the large-scale landing of the model. Therefore, how to compress the storage scale and the inference cost of a Large Language Model (LLMs) while preserving its expressive power as much as possible has become one of the important issues of current research. To alleviate the above challenges, researchers have proposed various model compression technological routes including weight quantification, low rank decomposition, structural pruning, knowledge distillation, and the like. The binarization (Binarization) is used as a limit quantization mode, so that model parameters can be compressed to 1bit, the storage and calculation cost is greatly reduced, and a compression rate of 32 x can be brought theoretically. Compared with a large model quantization method requiring retraining, a post-training quantization (PTQ) method is currently mainstream due to the characteristics of high efficiency, light weight and no need of back propagation. The recent methods such as BiLLM, ARB-LLM and the like obviously reduce the performance gap between the binary model and the full-precision model through strategies such as mixing precision, alternate optimization, distribution alignment and the like. The method and the device for quantifying the mixed precision of the large language model, such as the patent CN118036661B, electronic equipment and medium, distribute proper bit width according to weight sensitivity by separating outlier and normal data, and improve the algorithm precision and the hardware resource utilization rate of the model through quantifying the mixed precision, thereby reducing the performance gap between the binary model and the full-precision model. Meanwhile, the hybrid expert model (MoE) is also becoming one of the key architecture designs for improving the model capacity and computing efficiency. The mixed expert model (MoE) realizes the modeling capability of the ultra-large parameter space with lower calculation cost by introducing a plurality of expert sub-networks and activating only a small amount of experts to participate in reasoning in each forward calculation. The sparse activation mechanism enables a hybrid expert model (MoE) to maintain nearly linear computational overhead while expanding the amount of parameters. Typical hybrid expert models (MoEs), such as Switch Transformer, mixtral, deepSeek-MoE, phi-MoE, etc., have made significant breakthroughs in language modeling, dialog generation, etc. However, although the hybrid expert model (MoE) architecture has sparsity in the inference computation level, its overall parameter volumes are still enormous. Each expert subnetwork is essentially a complete feed-forward module, and all expert parameters must be preloaded or resident in GPU video memory, resulting in high storage and loading costs. This parameter redundancy severely limits the popularization and application of the hybrid expert model (MoE) in lightweight deployment and resource constrained environments. For this reason, existing efforts have attempted to compress a hybrid expert model (MoE) by combining two types of methods with an expert pruning. Pruning methods, such as MoE-I2, NAEE, reduce the number of parameters