CN-119721236-B - Large model reasoning optimization method and system based on lightweight gating mechanism

CN119721236BCN 119721236 BCN119721236 BCN 119721236BCN-119721236-B

Abstract

The invention discloses a large model reasoning optimization method and system based on a lightweight gating mechanism, which aim to improve the efficiency of a large-scale pre-training model in reasoning tasks and remarkably reduce the consumption of computing resources. The method comprises the steps of initializing a pre-training model, introducing a dynamic routing mechanism in the reasoning process, and combining a lightweight gating and threshold judgment mechanism to effectively screen the key layer output in the reasoning process. By dynamically evaluating the output of each layer, a gating output value is generated by using a gating LSTM unit, and the importance score of the generation layer is activated by a sigmoid function. If the score is below a preset threshold, the layer calculation is skipped to reduce unnecessary calculation overhead. The optimization strategy can adaptively judge the importance of each layer of output, and reduces redundant calculation and improves the reasoning speed on the premise of ensuring the output quality of the model. The method is suitable for various task scenes, including natural language processing, image generation and the like, and has wide application value.

Inventors

CHENG CUIPING
GAO XIANG

Assignees

之江实验室

Dates

Publication Date: 20260508
Application Date: 20241113

Claims (9)

1. A large model reasoning optimization method based on a lightweight gating mechanism is characterized by comprising the following steps: (1) Pre-training the large model by using a large-scale data set to generate a pre-training model suitable for an reasoning task; (2) Carrying out quantitative evaluation on importance scores of each layer of output of the pre-training model by utilizing a light weight gating mechanism to obtain confidence or importance scores of each layer of output; (3) Dynamically deciding whether to skip calculation of each layer based on the confidence or importance score output by the layer; (4) The method comprises the steps of carrying out gating evaluation on output of each layer through a gating LSTM unit, screening key output characteristics, skipping the layer and subsequent part calculation if the output importance score is lower than a set threshold value, adopting a lightweight neural network structure by a gating mechanism, determining whether to skip calculation of the current layer by using a lightweight gating LSTM, generating a gating value according to the input characteristics, wherein the value of the gating value is between 0 and 1, and processing through a torch.sigmoid () function; (5) And executing an reasoning task according to the screened key output characteristics, thereby improving the reasoning speed.
2. The method of claim 1, wherein the pre-training model is implemented by a causal language model of a fransformer structure, comprising BERT, GPT, QWen model structures.
3. The method for optimizing large model inference based on lightweight gating mechanism as claimed in claim 1, wherein the step (3) comprises defining specific layers processed by dynamic routing mechanism, selecting layers to be dynamically routed through selected_layers parameters, the dynamic routing mechanism allowing users or developers to select specific processing layers at different levels of the model, and the model selectively skipping certain layers according to task requirements during inference process or reserving specific layers through dynamic routing mechanism to ensure output accuracy.
4. The method for large model inference optimization based on lightweight gating mechanism as claimed in claim 1, wherein in the step (4), the output of the layer is linearly transformed to adjust the dimension to adapt to the input requirement of the LSTM, and the method specifically comprises: A. using linear transformation to adjust the dimension of each layer output to be consistent with LSTM input dimensions; B. The linear transformation is realized in a matrix multiplication mode, namely the layer output is transformed through a weight matrix, and the parameters of the linear layer are automatically updated through a training process so as to ensure the suitability of the parameters to the input.
5. The method of claim 1, wherein in the layer-by-layer forward propagation process, the output of each layer is processed by a dynamic routing function, if the output of a layer is judged to be of low importance, i.e. the gating value is smaller than a set threshold, the calculation of the layer is skipped and the output of the previous layer is used continuously, and the threshold is dynamically adjusted according to the complexity of the reasoning task.
6. The method for optimizing large model reasoning based on lightweight gating mechanism as claimed in claim 1, wherein when the reasoning task is a generated text task, the generated text is evaluated for diversity and fluency according to the filtering result of each layer.
7. A lightweight gating mechanism based large model inference optimization system implementing the method of claim 1, the system comprising: the pre-training module is used for pre-training the large model through the large-scale data set; The importance evaluation module is used for embedding a light-weight gating mechanism and quantitatively evaluating the importance of each layer of output; the dynamic layer jump module is used for dynamically jumping over the calculation of certain layers according to the output confidence or importance score; The gating module evaluates the output of each layer through a lightweight gating mechanism and judges whether to skip the calculation of the layer; And the reasoning execution module is used for executing the reasoning task according to the screened important output information.
8. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor, wherein the memory is configured to store program data, and wherein the processor is configured to execute the program data to implement the lightweight gating mechanism-based large model inference optimization method of any of claims 1-6.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a lightweight gating mechanism based large model inference optimization method as claimed in any one of claims 1-6.

Description

Large model reasoning optimization method and system based on lightweight gating mechanism Technical Field The invention relates to the field of artificial intelligence, in particular to a large model reasoning optimization method and system based on a lightweight gating mechanism, which aims to improve the reasoning speed of a large-scale pre-training language model and reduce the consumption of computing resources. Background The current large-scale language models (such as GPT, BERT and the like) show excellent performance on a plurality of tasks, but the number of layers is large, so that the calculation resource requirement of the model reasoning process is very high, the model reasoning efficiency is lower in practical application, and the response speed is influenced. The existing large model reasoning optimization technology such as pruning, quantization and the like still has the problems of large calculated amount, long reasoning time, performance reduction and the like although the reasoning speed can be partially improved. Therefore, how to optimize the reasoning speed without significantly affecting the model generation quality becomes a hotspot of current research. Disclosure of Invention The invention aims to overcome the defects of the prior art, provides a large model reasoning optimization method and system based on a lightweight gating mechanism, is particularly suitable for improving the reasoning efficiency and speed by utilizing a dynamic routing mechanism in the reasoning process, and effectively reduces the calculated amount, improves the reasoning speed and resource utilization rate, and simultaneously maintains the model performance and accuracy by selectively skipping certain network layers to optimize the large model reasoning process. In order to achieve the above purpose, the invention provides a large model reasoning optimization method based on a lightweight gating mechanism, which comprises the following steps: (1) Pre-training the large model by using a large-scale data set to generate a pre-training model suitable for an reasoning task; (2) Carrying out quantitative evaluation on importance scores of each layer of output of the pre-training model by utilizing a light weight gating mechanism to obtain confidence or importance scores of each layer of output; (3) Dynamically deciding whether to skip calculation of each layer based on the confidence or importance score output by the layer; (4) Performing gating evaluation on the output of each layer through a gating LSTM unit, and screening out key output characteristics; (5) And executing an reasoning task according to the screened key output characteristics, thereby improving the reasoning speed. Further, the pre-training model is implemented by a causal language model of a transducer structure, including BERT, GPT, QWen model structures. Further, the step (3) comprises defining specific layers processed by a dynamic routing mechanism, selecting the layers to be dynamically routed through selected_layers parameters, enabling a user or a developer to select specific processing layers at different layers of the model, and enabling the model to selectively skip certain layers according to the task requirements in the reasoning process through the dynamic routing mechanism or reserving the specific layers to ensure the accuracy of output. Further, in the step (4), the output of the layer is subjected to linear transformation to adjust the dimension so as to adapt to the input requirement of the LSTM, and the method specifically includes: A. using linear transformation to adjust the dimension of each layer output to be consistent with LSTM input dimensions; B. The linear transformation is realized in a matrix multiplication mode, namely the layer output is transformed through a weight matrix, and the parameters of the linear layer are automatically updated through a training process so as to ensure the suitability of the parameters to the input. Further, in the step (4), the gating mechanism adopts a lightweight neural network structure, uses a lightweight gating LSTM to decide whether to skip the calculation of the current layer, generates a gating value according to the input characteristics, wherein the gating value is between 0 and 1, and is processed through a torch.sigmoid () function, and skips the calculation of the layer if the gating value is smaller than a set threshold value, returns to None, otherwise, continues the calculation of the layer. Further, in the layer-by-layer forward propagation process, the output of each layer is processed through a dynamic routing function, if the output of a certain layer is judged to be of low importance, namely, the gating value is smaller than a set threshold value, the calculation of the layer is skipped, the output of the previous layer is used continuously, and the threshold value is dynamically adjusted according to the complexity of the reasoning task. Further, when the reaso