CN-122020132-A - Data management method oriented to large language model

CN122020132ACN 122020132 ACN122020132 ACN 122020132ACN-122020132-A

Abstract

The invention relates to the field of data processing, in particular to a data management method oriented to a large language model, which is characterized in that a complete KV cache sequence is divided into a plurality of continuous sub-blocks with equal width, and respectively extracting the dispersion degree representing the local information density and the association concentration degree dual-channel characteristic representing the global importance. And then, calculating the consistency ratio of the two channels, fusing to generate the self-adaptive retention weight of each sub-block, and finally unevenly distributing the cache quota to each sub-block, and screening, compressing and reorganizing the internal tokens according to the attention degree. The invention effectively reduces the consumption of the video memory of the long text reasoning, successfully inhibits the misjudgment caused by single feature evaluation, and realizes differential and accurate quota allocation. The method can remove redundant information greatly and simultaneously reserve high-value semantic anchor points to the maximum extent, so that the quality of autoregressive decoding reasoning is guaranteed under limited hardware resources.

Inventors

XU QI

Assignees

中南信息科技(深圳)有限公司
禹州市中南信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A data management method facing to a large language model is characterized by comprising the steps of dividing a key value cache sequence generated in a large language model reasoning process into a plurality of continuous sub-blocks along a token position dimension; Calculating a dispersion coefficient representing the local information density level based on the statistical dispersion degree of the Euclidean distance of the value vector of each token in each sub-block relative to the sub-block mean value vector; Aiming at each sub-block, calculating the association concentration degree representing the global semantic association concentration degree based on the distribution inequality degree of the accumulation degree of the global attention of each token in the attention weight matrix generated in the large language model reasoning process in the sub-block; calculating a consistency ratio based on the consistency degree of the values of the dispersion coefficient and the association concentration, and carrying out gating modulation on the fusion value of the dispersion coefficient and the association concentration by combining the consistency ratio to generate a retention weight of each sub-block; And unevenly distributing the cache reservation quota available to the system to each sub-block according to the reservation weight, and screening the token according to the attention value in each sub-block to generate a compressed key value cache sequence.
2. The data management method for a large language model according to claim 1, wherein the calculating method of the dispersion coefficient comprises: performing element-by-element mean aggregation on the value vectors of all tokens in the sub-block to obtain a sub-block mean vector; Calculating Euclidean distance between the value vector of each token in the sub-block and the sub-block mean value vector; Calculating arithmetic average distance and distance standard deviation for all Euclidean distance values in the sub-blocks; And taking the ratio of the standard deviation of the distance to the arithmetic average distance as a dispersion coefficient of the sub-block.
3. The data management method for a large language model according to claim 1, wherein the calculating method of the association concentration comprises: performing arithmetic average aggregation on the attention weight matrix along the attention head dimension to obtain an average attention matrix; Performing summation operation on the average attention matrix according to the column direction to obtain an attention value of each token position; and arranging the attention values in ascending order in each sub-block, and calculating the association concentration of the sub-block through a coefficient formula based on the ordered attention values.
4. The large language model oriented data management method according to claim 1, wherein the consistency ratio calculating method comprises the following steps: Taking the smaller value in the dispersion coefficient and the association concentration as a numerator, taking the larger value in the dispersion coefficient and the association concentration as a denominator, and taking the ratio of the numerator and the denominator as the consistency ratio of the sub-block.
5. The method for managing data for a large language model according to claim 1, wherein the generating means of the retention weight comprises: Characterizing a composite importance level of a sub-block with an arithmetic mean of the dispersion coefficient and the association concentration; confidence weighting the comprehensive importance level with the consistency ratio as a gating coefficient; a global normalization term is introduced to make the sum of the reservation weights of all the sub-blocks equal to the total number of the sub-blocks so as to ensure that the reservation weights only change the quota allocation proportion among the sub-blocks.
6. The method for large language model oriented data management according to claim 1, wherein unevenly distributing the buffer reservation quota available to the system to each sub-block according to the reservation weight comprises: Determining a target total reserved token number according to the current available video memory capacity and the memory occupation amount of a single token key value pair; dividing the target total reserved token number by the total number of sub-blocks to obtain a basic reserved quantity; multiplying the reservation weight of each sub-block by the basic reservation amount and rounding the reservation weight to be used as the number of reservation tokens of the sub-block.
7. The large language model oriented data management method of claim 6, further comprising: and in response to the deviation between the sum of the rounded reserved token numbers of all the sub-blocks and the target total reserved token number, executing compensation processing on the deviation, sequentially deducting positive deviation from the sub-block with the lowest reserved weight, and sequentially supplementing negative deviation to the sub-block with the highest reserved weight until the total reserved number is strictly equal to the target total reserved token number.
8. The large language model oriented data management method of claim 6, wherein the filtering of tokens by the attention value is performed inside each sub-block, comprising: ordering all tokens in the sub-block from high to low according to the attention value; the key value cache data corresponding to the tokens ranked within the reserved token number are reserved; rearranging and splicing the key value data reserved by each sub-block from small to large according to the position index in the original sequence to form the compressed key value cache sequence.
9. The large language model oriented data management method of claim 1, further comprising: after the dispersion coefficients of all the sub-blocks are calculated, performing outlier detection, and clamping the dispersion coefficient value exceeding the range of three times of standard deviation of the average value of the dispersion coefficients of all the sub-blocks to the boundary value of the range.
10. The large language model oriented data management method of claim 1, further comprising: And in response to detecting that the length of the sequence formed by the dispersion coefficient is inconsistent with the length of the sequence formed by the association concentration, triggering degradation processing logic, discarding the dual-channel characteristic data of the current batch, and backing to a degradation mode of executing cache compression on each sub-block with uniform weight.

Description

Data management method oriented to large language model Technical Field The invention relates to the field of data processing, in particular to a data management method oriented to a large language model. Background In recent years, large Language Models (LLMs) have exhibited excellent performance in various types of natural language processing tasks. In the autoregressive decoding reasoning process of a large language model, in order to avoid repeated computation, a key value caching (KV Cache) mechanism is generally introduced, namely, key vectors and value vectors of a history Token (Token) are stored in a video memory for inquiry and multiplexing in subsequent generation. The mechanism greatly improves the reasoning speed and is a basic data management component of the current large-model reasoning engine. With the progressive development of large model application scenarios, the text length input by a user increases dramatically, for example, in the scenarios of processing long legal contracts, mixed technical documents containing a large number of codes and parameters, or ultra-long business dialogue records. In these long text reasoning scenarios, the amount of KV cached data will expand rapidly as the length of the context sequence increases. Because of the extremely limited memory capacity of the server Graphics Processing Unit (GPU), the vast KV cache can quickly drain the available memory, causing memory overflow or forcing the system to reduce concurrent throughput. Therefore, the KV cache is screened and compressed under the condition of limited video memory resources, so that the continuous reasoning of the long text is maintained, and the necessary requirement under the scene is met. However, when facing a complex long document in actual business, the existing cache elimination means adopting uniform compression or single evaluation index is easy to cause serious actual business abnormality. For example, when a user requests a large model to analyze a technical document mixed with a long background utterances and dense core technical parameters, a single index is prone to misjudgment, namely, the utterances at the beginning of an article are reserved as important points under the influence of position deviation, or a piece of redundant text with a large number of irrelevant proper nouns is misidentified as important information. Such single-dimensional misjudgment can lead to the system wasting valuable memory quota on low-value redundant information, and instead eliminating core terms or parameters that really play a key role. Finally, when the large model answers the questions of the users aiming at the core details, serious key information forgetting or mastering is caused, namely, the phenomenon of 'illusion' is caused, so that the long text question answering and analysis service is completely disabled. Disclosure of Invention Aiming at the problem that serious actual business abnormality is easy to cause by adopting a cache elimination means of uniform compression or single evaluation index, the invention provides a data management method for a large language model, which comprises the steps of dividing a key value cache sequence generated in a large language model reasoning process into a plurality of continuous sub-blocks along a token position dimension, calculating a dispersion coefficient representing a local information density level according to the statistical discrete degree of a Euclidean distance of a value vector of each token in each sub-block relative to a sub-block mean value vector aiming at each sub-block, calculating an association concentration degree representing a global semantic association concentration degree according to the distribution inequality of the accumulation degree of each token in a global attention weight matrix generated in the large language model reasoning process in the sub-block on the basis of the retention weight of each sub-block, calculating a consistency ratio according to the numerical consistency degree of the dispersion coefficient and the association concentration degree, carrying out gate control modulation on a fusion value of the dispersion coefficient and the association concentration degree, generating a retention weight of each sub-block, and carrying out non-uniform distribution of the system to each sub-block to each retention weight according to the retention weight, and carrying out key value compression and filtering on the obtained value. Compared with the prior large language model pushing, the method adopts a cache management scheme of uniformly eliminating key semantics lost or relying on single attention dimension, and combines local information density and global semantic association concentration, and utilizes consistency ratio to carry out gating modulation to unevenly distribute reserved quota, so that key value cache can be greatly compressed to reduce equipment video memory consumption, key detail omission and glob