CN-122021624-A - Processing method and device of large language model and electronic equipment

CN122021624ACN 122021624 ACN122021624 ACN 122021624ACN-122021624-A

Abstract

The invention provides a processing method, a processing device and electronic equipment of a large language model, wherein the large language model comprises a plurality of target layers, the target layers are layers for processing sequence data in a large language model framework, the method comprises the steps of obtaining a text to be processed, dividing the text to be processed to obtain at least one word, extracting features of the at least one word by utilizing the plurality of target layers to obtain feature information corresponding to each target layer, the feature information comprises structural dimension features, time dimension features and content dimension features which are respectively opposite to key value pairs corresponding to the word, carrying out feature fusion according to features of each dimension in the feature information of each target layer to obtain a comprehensive score, wherein the comprehensive score represents the importance degree of a key value pair corresponding to the text to be processed, determining target precision according to the comprehensive score, quantifying the key value pair corresponding to the word according to the target precision, and storing the key value pair corresponding to the word into a key value cache.

Inventors

LIU BING
LI LEI
LI YUBO

Assignees

联想(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (10)

1. A method of processing a large language model, the large language model comprising a plurality of target layers, the target layers being layers of processing sequence data in the large language model architecture, the method comprising: Obtaining a text to be processed, and dividing the text to be processed to obtain at least one word; Extracting features of at least one word segment by utilizing the plurality of target layers to obtain feature information corresponding to each target layer, wherein the feature information comprises structural dimension features, time dimension features and content dimension features which are respectively opposite to key value pairs corresponding to the word segments; feature fusion is carried out according to features of each dimension in the feature information of each target layer, so that a comprehensive score is obtained, and the comprehensive score characterizes the importance degree of the key value pair corresponding to the text to be processed; and determining target precision according to the comprehensive score, quantizing key value pairs corresponding to the word segmentation according to the target precision, and storing the quantized key value pairs into a key value cache.
2. The method of claim 1, wherein the feature extraction of at least one word segment by using the plurality of target layers to obtain feature information corresponding to each target layer, includes: After each processing layer is utilized to generate a key value pair corresponding to each word segmentation, calculating an L2 norm variance of each key value pair as the structural dimension characteristic; Determining the time dimension characteristics of each key value pair according to the global time stamp and the generation time; Generating a key value pair corresponding to each word by using each processing layer, and generating a key value matrix according to the obtained key value pair; determining the information entropy of each key value pair according to the key value matrix, and determining the content dimension characteristics according to the information entropy; the structure dimension feature, the time dimension feature and the content dimension feature are used for reflecting the importance degree of the key value pair, and the more the generation time is, the higher the importance degree of the time dimension feature reaction is.
3. The method according to claim 1, wherein the feature fusion according to the features of the dimensions in the feature information of each target layer comprises: Acquiring a first weight, a second weight and a third weight respectively corresponding to the structure dimension feature, the time dimension feature and the content dimension feature; And weighting the relative structural dimension characteristics, the time dimension characteristics and the content dimension characteristics according to the first weight, the second weight and the third weight and each key value to obtain a comprehensive score.
4. A method according to claim 1 or 3, the method further comprising: Respectively carrying out normalization processing on the structural dimension feature, the time dimension feature and the content dimension feature by utilizing a processing rule to obtain the structural dimension feature, the time dimension feature and the content dimension feature after normalization processing, wherein the processing rule is used for unifying the values of the structural dimension feature, the time dimension feature and the content dimension feature to a target interval; Correspondingly, feature fusion is carried out according to the features of each dimension in the feature information of each target layer, wherein the feature fusion is carried out according to the structural dimension features, the time dimension features and the content dimension features after normalization processing, and second information is obtained.
5. A method according to claim 3, the method further comprising: determining the confusion degree of the text to be processed under the processing of the target layers, wherein the confusion degree characterizes the uncertainty of target layer segmentation prediction; And adjusting the second weight and the third weight according to the variation trend of the confusion degree.
6. The method of claim 5, adjusting the second weight and the third weight according to the trend of the confusion, comprising: if the change trend of the confusion degree is that the confusion degree rises, the third weight is increased, and the second weight is reduced; and if the change trend of the confusion degree is that the confusion degree is reduced, reducing the third weight and increasing the second weight.
7. The method of claim 1, determining a target accuracy from the composite score, comprising: and determining the target precision corresponding to the comprehensive score according to the comprehensive score query precision distribution table, wherein the precision distribution table is used for recording the precision corresponding to each score.
8. The method of claim 1, the method further comprising: Reading compressed data corresponding to the context of the text to be processed from the key value cache; dequantizing the compressed data according to the target precision to obtain a dequantized result; And carrying out reasoning on the text to be processed by using the dequantization result to obtain a reasoning result.
9. A processing apparatus of a large language model, the large language model comprising a plurality of target layers, the target layers being layers of the large language model architecture that process sequence data, the apparatus comprising: The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text to be processed and dividing the text to be processed to obtain at least one word; the first processing module is used for extracting features of at least one word segment by utilizing the plurality of target layers to obtain feature information corresponding to each target layer, wherein the feature information comprises structural dimension features, time dimension features and content dimension features which are opposite to key value pairs corresponding to the word segment respectively; The second processing module is used for carrying out feature fusion according to the features of each dimension in the feature information of each target layer to obtain a comprehensive score; And the third processing module is used for determining target precision according to the comprehensive score, quantizing key value pairs corresponding to the segmentation according to the target precision and storing the quantized key value pairs into a key value cache.
10. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: Obtaining a text to be processed, and dividing the text to be processed to obtain at least one word; extracting features of at least one word segment by utilizing a plurality of target layers to obtain feature information corresponding to each target layer, wherein the feature information comprises structural dimension features, time dimension features and content dimension features which are respectively opposite to key value pairs corresponding to the word segments; Feature fusion is carried out according to features of each dimension in the feature information of each target layer, so that a comprehensive score is obtained; and determining target precision according to the comprehensive score, quantizing key value pairs corresponding to the word segmentation according to the target precision, and storing the quantized key value pairs into a key value cache.

Description

Processing method and device of large language model and electronic equipment Technical Field The disclosure relates to the technical field of artificial intelligence, and relates to a processing method and device of a large language model and electronic equipment. Background In the autoregressive generation (reasoning) of a large language model, key (Key) and Value (Value) tensors calculated in all previous time steps, namely a Key-Value pair Cache (KV Cache), need to be cached so as to avoid repeated calculation. With the increase of the length of the generated sequence, the memory occupation of the KV Cache increases in a linear or even square level, and becomes a main bottleneck for restricting the deployment and application of the model on resource-constrained equipment (such as a single-card GPU). The extremely high memory occupation not only limits the Batch Size (Batch Size), but also causes a significant increase in reasoning delay due to memory bandwidth bottlenecks and memory exchanges, and the throughput drops dramatically. Such as uniformly quantizing all data in the entire KV Cache to a lower bit width (e.g., FP 16- > INT8/INT 4). The method is simple and can compress the memory, but adopts a one-cut strategy, and ignores the difference of contribution degrees of different data to the model output. Excessive quantization of important features can introduce large errors, resulting in significant degradation of model generation quality (accuracy, fluency). If offline hybrid precision quantization is adopted, a fixed quantization strategy (for example, high bit width is allocated to some layers of the model and low bit width is allocated to other layers) is predetermined before deployment by analyzing indexes such as model weight gradient, but the quantization strategy of the scheme is fixed before deployment and cannot be adjusted, so that flexibility is lacking. Or dynamically adjusting the quantization strategy based on the characteristics of a single input, although it can optimize quantization for a particular input, due to the single dimension of consideration, if the distribution of the input data is greatly different from that of the conventional data, the quantization range may be distorted, thereby affecting the model performance. Disclosure of Invention The disclosure provides a processing method and device of a large language model and electronic equipment. In a first aspect, an embodiment of the present disclosure provides a method for processing a large language model, where the large language model includes a plurality of target layers, and the target layers are layers for processing sequence data in the large language model architecture, and the method includes: Obtaining a text to be processed, and dividing the text to be processed to obtain at least one word; Extracting features of at least one word segment by utilizing the plurality of target layers to obtain feature information corresponding to each target layer, wherein the feature information comprises structural dimension features, time dimension features and content dimension features which are respectively opposite to key value pairs corresponding to the word segments; feature fusion is carried out according to features of each dimension in the feature information of each target layer, so that a comprehensive score is obtained, and the comprehensive score characterizes the importance degree of the key value pair corresponding to the text to be processed; and determining target precision according to the comprehensive score, quantizing key value pairs corresponding to the word segmentation according to the target precision, and storing the quantized key value pairs into a key value cache. In a second aspect, an embodiment of the present disclosure provides a processing apparatus for a large language model, where the large language model includes a plurality of target layers, and the target layers are layers for processing sequence data in the large language model architecture, and the apparatus includes: The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text to be processed and dividing the text to be processed to obtain at least one word; the first processing module is used for extracting features of at least one word segment by utilizing the plurality of target layers to obtain feature information corresponding to each target layer, wherein the feature information comprises structural dimension features, time dimension features and content dimension features which are opposite to key value pairs corresponding to the word segment respectively; The second processing module is used for carrying out feature fusion according to the features of each dimension in the feature information of each target layer to obtain a comprehensive score; And the third processing module is used for determining target precision according to the compreh