CN-121981256-A - Large model end side inference optimization method, device, equipment and storage medium

CN121981256ACN 121981256 ACN121981256 ACN 121981256ACN-121981256-A

Abstract

The application relates to a large model end-side theory optimization method, a device, equipment and a storage medium. The method comprises the steps of predicting dynamic sparse proportion according to user input information of a current layer of a large model, reconstructing a generated standard attention moment array based on the dynamic sparse proportion to obtain a sparse attention matrix, wherein the standard attention moment array is generated based on the user input information, carrying out dynamic step size quantization on weight parameters of the sparse attention matrix based on back propagation, carrying out mixed precision distribution on the weight parameters of different layers according to quantization results, carrying out large model reasoning calculation based on the weight parameters after precision adjustment to generate a key-value cache matrix, and carrying out real-time compression on the key-value cache matrix based on entropy coding. The terminal side inference method provided by the application has the advantages of lower computation complexity, stronger self-adaptive capacity, higher inference precision and less memory occupation.

Inventors

WU ZHENJIE

Assignees

深圳市广和通无线股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251226

Claims (10)

1. The large model end side theory optimizing method is characterized by comprising the following steps: Predicting a dynamic sparse proportion according to user input information of a current layer of a large model, and reconstructing a generated standard attention moment array based on the dynamic sparse proportion to obtain a sparse attention matrix, wherein the standard attention moment array is generated based on the user input information; Carrying out dynamic step size quantization on weight parameters of the sparse attention matrix based on back propagation, carrying out mixed precision distribution on the weight parameters of different layers according to quantization results, and carrying out large model reasoning calculation based on the weight parameters after precision adjustment to generate a key-value cache matrix; And compressing the key-value buffer matrix in real time based on entropy coding.
2. The large model end-side inference optimization method according to claim 1, wherein before the predicting a dynamic sparse ratio according to user input information of a current layer of a large model, reconstructing a generated standard attention matrix based on the dynamic sparse ratio, the method further comprises: Acquiring the user input information of a large model, wherein the user input information comprises user input text; Converting the user input text into a plurality of text minimum semantic units through a word segmentation device, and carrying out vector representation on the text minimum semantic units to generate a user semantic input vector sequence, wherein the user semantic input vector sequence comprises a query vector, a key vector and a value vector; performing linear conversion on the user semantic input vector sequence to generate the standard attention matrix; and calculating the standard attention matrix according to a query vector matrix, a key vector matrix and vector dimensions, wherein the query vector matrix is constructed based on all the query vectors in the user semantic input vector sequence, and the key vector is constructed based on all the key vectors in the user semantic input vector sequence.
3. The method for optimizing end-to-end inference of a large model according to claim 2, wherein predicting a dynamic sparse ratio according to user input information of a current layer of the large model, reconstructing a generated standard attention moment matrix based on the dynamic sparse ratio, and obtaining a sparse attention matrix comprises: Performing feature stitching on the query vector matrix and the key vector matrix of the current layer of the large model to obtain stitching features; inputting the spliced characteristics into a sparse proportion prediction model for calculation to obtain the dynamic sparse proportion; Based on the dynamic sparse proportion, calculating a Top-k maximum value index of each row in the standard attention matrix obtained through calculation to generate a binary sparse mask; And carrying out sparse processing on the calculated standard attention moment matrix based on the binary sparse mask to obtain the sparse attention matrix.
4. The method of end-to-end inference optimization of claim 2, wherein the performing dynamic step quantization on the weight parameters of the sparse attention matrix based on back propagation, performing hybrid precision allocation on the weight parameters of different layers according to quantization results, performing large model inference computation based on the weight parameters after precision adjustment, and generating a key-value buffer matrix includes: Predefining a quantization function based on a quantization step size and the weight parameter, wherein the quantization function is used for performing precision conversion on the weight parameter through the quantization step size; dynamically adjusting the quantization step length based on micro-constraint, and calculating the gradient of the quantization step length according to the dynamic adjustment result of the quantization step length and the quantization function during counter propagation; updating the quantization step length according to the gradient of the quantization step length based on a chain rule to obtain the quantization result; And carrying out mixed precision distribution on the weight parameters of different layers according to the quantization result, and carrying out large model reasoning calculation on the basis of a value vector matrix, the key vector matrix and the weight parameters after precision adjustment to generate the key-value cache matrix, wherein the value vector matrix is constructed on the basis of all value vectors.
5. The method of optimizing large model end-side inference according to claim 4, wherein the performing mixed precision allocation on the weight parameters of different layers according to the quantization result, performing large model inference calculation based on a value vector matrix, the key vector matrix and the weight parameters after precision adjustment, and generating a key-value cache matrix includes: performing precision reduction processing on the weight parameters according to the quantization result to obtain first precision and second precision of the weight parameters, wherein the first precision is higher than the second precision; Matching the weight parameters of the first precision for a residual error connection layer of the large model during deployment, and matching the weight parameters of the second precision for other layers of the large model; And carrying out large model reasoning calculation based on the key vector matrix, the value vector matrix and the weight parameters after the precision adjustment to generate the key-value cache matrix, wherein the key-value cache matrix comprises the key vector, the value vector or the combination of the key vector and the value vector of the text minimum semantic unit.
6. The large model end-side inference optimization method according to claim 1, wherein the compressing the key-value buffer matrix in real time based on entropy coding and decompressing data based on the constructed scaling function when decompressing comprises: Calculating probability distribution of each row of the key-value buffer matrix; and compressing each row of the key-value buffer matrix through the entropy coding based on the probability distribution to obtain compressed data of the key-value buffer matrix.
7. The large model end-side thrust optimization method of claim 6, further comprising, after the compressing the key-value cache matrix in real time based on entropy encoding: Decoding the compressed data through entropy decoding when the large model decompresses the data, and obtaining an approximate value of the key-value cache matrix; And constructing a scaling function based on the approximate value of the key-value buffer matrix and a scaling factor, and performing original data scaling reconstruction on the approximate value of the key-value buffer matrix through the scaling function, wherein the scaling factor comprises the maximum value in the key-value buffer matrix.
8. The large model end side theory optimizing device is characterized in that the device comprises: The sparse processing module is used for predicting a dynamic sparse proportion according to user input information of a current layer of the large model, reconstructing a generated standard attention moment array based on the dynamic sparse proportion to obtain a sparse attention matrix, wherein the standard attention moment array is generated based on the user input information; the weight adjustment module is used for carrying out dynamic step length quantization on weight parameters of the sparse attention matrix based on back propagation, carrying out mixed precision distribution on the weight parameters of different layers according to quantization results, carrying out large model reasoning calculation based on the weight parameters after precision adjustment, and generating a key-value cache matrix; and the data compression module is used for compressing the key-value buffer matrix in real time based on entropy coding.
9. A computer device comprising a processor, a memory and a network interface, said memory storing machine readable instructions executable by said processor, characterized in that when said computer device is run, said processor communicates with said memory via the network interface, said processor executes said machine readable instructions to perform the steps of the large model side inference optimization method according to any of the preceding claims 1 to 7.
10. A computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the steps of the large model end-side inference optimization method of any one of claims 1 to 7.

Description

Large model end side inference optimization method, device, equipment and storage medium Technical Field The application relates to the technical field of large models, in particular to a large model end side theory optimizing method, a device, equipment and a storage medium. Background In the field of large model end-side inference, the prior art has proposed various schemes for attention mechanism, model quantization and KV Cache management. In the aspect of attention mechanism calculation, for example Longformer, although the part of complexity is reduced, the dynamic sequence length is difficult to adapt, and the random sparsification is proposed to improve later, so that the dynamics is introduced, but the self-adaptive adjustment according to the input data characteristics is difficult to adapt in an reasoning stage depending on a predefined sparsity parameter, so that the improvement of the calculation efficiency is further limited. For model quantization, static quantization cannot adapt to dynamic input distribution of the end-side equipment, and dynamic quantization solves the problem of static quantization by calculating a scaling factor in real time, but introduces extra calculation expense, so that the real-time requirement of the end-side equipment is difficult to meet, and deployment is difficult to realize. In KV Cache management, the existing compression method obtains a block-level pruning strategy through a low-rank approximation technology, but numerical accuracy is sacrificed or the method cannot be flexibly adapted to input sequence characteristic distribution, and the memory saving effect is limited. Therefore, in the prior art of large model end-side reasoning, the problems of poor dynamic adaptability, low error control precision and low memory efficiency exist. Disclosure of Invention The application provides a large model end side inference optimization method, a device, equipment and a storage medium, which are used for solving the technical problems of poor dynamic adaptability, low error control precision and low memory efficiency in the prior art of large model end side inference. According to one aspect of the embodiment of the application, the application provides a large model end-to-end inference optimization method, which comprises the steps of predicting dynamic sparse proportion according to user input information of a current layer of a large model, reconstructing a generated standard attention moment array based on the dynamic sparse proportion to obtain a sparse attention matrix, wherein the standard attention moment array is generated based on the user input information, carrying out dynamic step size quantization on weight parameters of the sparse attention matrix based on back propagation, carrying out mixed precision distribution on the weight parameters of different layers according to quantization results, carrying out large model inference calculation based on the weight parameters after precision adjustment to generate a key-value cache matrix, and carrying out real-time compression on the key-value cache matrix based on entropy coding. Optionally, before the dynamic sparse proportion is predicted according to the user input information of the current layer of the large model, the generated standard attention matrix is reconstructed based on the dynamic sparse proportion, the method further comprises the steps of obtaining the user input information of the large model, wherein the user input information comprises user input text, converting the user input text into a plurality of text minimum semantic units through a word segmentation device, carrying out vector representation on the text minimum semantic units to generate a user semantic input vector sequence, wherein the user semantic input vector sequence comprises query vectors, key vectors and value vectors, carrying out linear conversion on the user semantic input vector sequence to generate the standard attention matrix, and calculating the standard attention matrix according to a query vector matrix, a key vector matrix and vector dimensions, wherein the query vector matrix is constructed based on all the query vectors in the user semantic input vector sequence, and the key vectors are constructed based on all the key vectors in the user semantic input vector sequence. The method comprises the steps of predicting dynamic sparse proportion according to user input information of a current layer of a large model, reconstructing a generated standard attention moment matrix based on the dynamic sparse proportion to obtain a sparse attention matrix, performing feature stitching on the query vector matrix and the key vector matrix of the current layer of the large model to obtain stitching features, inputting the stitching features into a sparse proportion prediction model to calculate to obtain the dynamic sparse proportion, calculating Top-k maximum indexes of each row in the standard attention matrix obta