CN-121981267-A - Caching method, device, equipment and medium for large language model key value

CN121981267ACN 121981267 ACN121981267 ACN 121981267ACN-121981267-A

Abstract

The application relates to the technical field of natural language processing, the field of financial services and the field of intelligent medical treatment, in particular to a caching method, a caching device, caching equipment and caching media for large language model key values. The method comprises the steps of acquiring key value state vectors of N target layers of a large language model after a preset initial layer, obtaining corresponding direction components and amplitude components, carrying out interpolation combination on two direction components of adjacent target layers, and writing shared state vectors of all connected target layers into a cache. The method can be applied to the business fields of financial services, intelligent medical treatment and the like, the key value state vectors of N target layers after a starting layer is preset by collecting a large language model, the target layers are divided into at least one pair of adjacent target layers according to a hierarchical adjacent relation, the direction components of the adjacent target layers are interpolated and combined to obtain a combined vector, and finally the shared state vectors of all the adjacent target layers are written into a cache. And then decomposing and interpolating the large language model key value to improve the high efficiency of key value buffer.

Inventors

WANG JIANZONG
ZHANG XULONG
SHI JIAQI

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method for caching large language model key values, comprising: Collecting key value state vectors of N target layers of a large language model after a preset initial layer, wherein N is an integer greater than 1; decomposing each key value state vector to obtain a corresponding direction component and amplitude component; Dividing the N target layers into at least one pair of adjacent target layers according to a hierarchy adjacent relation, and interpolating and combining two direction components of the adjacent target layers aiming at any pair of adjacent target layers to obtain a combined vector corresponding to the adjacent target layers; And forming a shared state vector by the merging vector of the adjacent target layer and the two amplitude components of the adjacent target layer, and writing the shared state vector of all the connected target layers into a cache.
2. The method for caching large language model key values according to claim 1, wherein interpolating and merging two direction components of the adjacent target layer to obtain a merged vector corresponding to the adjacent target layer, includes: calculating a direction included angle between a direction component of a preceding target layer and a direction component of a following target layer in the adjacent target layers; according to the direction included angle, interpolating the direction component of the previous target layer and the direction component of the next target layer in the adjacent target layers to obtain a rotary transformation matrix of the two direction components along the shortest path on the unit sphere; the rotation transformation matrix is determined as a merging vector corresponding to the adjacent target layer.
3. The method for caching large language model key according to claim 1, further comprising, after said writing the shared state vectors of all connected target layers into the cache: When the original key value state vector is detected to be used, recovering the shared state vector in the cache; The recovery operation includes: performing inverse transformation on the merging vectors in the shared state vector to obtain a first direction component of a previous target layer and a second direction component of a subsequent target layer; Rescaling the first direction component using an amplitude component of a preceding target layer in the shared state vector to obtain a restored key value state vector of the preceding target layer; and rescaling the second direction component by using the amplitude component of the later target layer in the shared state vector to obtain the restored key value state vector of the later target layer.
4. A method for caching large language model key values according to any one of claims 1 to 3, further comprising, before said interpolating and merging two direction components of said adjacent target layer to obtain a merged vector corresponding to said adjacent target layer: for any token, calculating the difference degree of two key value state vectors of the token in the adjacent target layer; and if the difference degree meets a preset condition, performing interpolation and combination on the two direction components of the adjacent target layers to obtain a combination vector corresponding to the adjacent target layers.
5. The method of claim 4, wherein calculating the degree of difference between two key state vectors of the token at the adjacent target layer comprises: Calculating the angular distance between two key value state vectors of the token in the adjacent target layers; comparing the angular distance with a preset distance threshold value to obtain a comparison result; determining a difference degree according to the comparison result, and if the comparison result shows that the angular distance is larger than the preset distance threshold value, determining that the difference degree does not meet a preset condition; and if the comparison result shows that the angular distance is not greater than the preset distance threshold, determining that the difference degree meets the preset condition.
6. The method for caching large language model key values according to claim 4, further comprising, after said calculating a degree of difference between two key value state vectors of said token at said adjacent target layer: If the difference degree does not meet the preset condition, constructing indexes of the token and two key value state vectors of the token at the adjacent target layer, and not executing merging operation on the two key value state vectors, wherein the indexes are used for writing the corresponding key value state vectors into the restored state vectors when restoring the shared state vectors in the cache; and executing merging operation on other adjacent target layers or other token.
7. The method for caching large language model key as defined in claim 6, further comprising, after said writing the shared state vectors of all connected target layers into the cache: When the original key value state vector is detected to be used, carrying out recovery operation on the shared state vector of each token in the cache along the token sequence to obtain a recovered state vector sequence; And writing the key value state vector which does not execute the merging operation into the recovered state vector sequence according to the index to obtain a final state vector sequence.
8. A caching apparatus for large language model key values, comprising: the vector acquisition module is used for acquiring key value state vectors of N target layers of the large language model after a preset initial layer, wherein N is an integer greater than 1; the vector decomposition module is used for respectively decomposing each key value state vector to obtain a corresponding direction component and amplitude component; The vector merging module is used for dividing the N target layers into at least one pair of adjacent target layers according to the adjacent relation of the layers, and interpolating and merging two direction components of the adjacent target layers aiming at any pair of adjacent target layers to obtain merging vectors corresponding to the adjacent target layers; and the merging and caching module is used for forming a shared state vector by the merging vector of the adjacent target layer and the two amplitude components of the adjacent target layer, and writing the shared state vector of all the connected target layers into a cache.
9. A computer device comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, the processor implementing the caching method of large language model keys according to any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the caching method of large language model key values according to any one of claims 1 to 7.

Description

Caching method, device, equipment and medium for large language model key value Technical Field The application relates to the technical field of natural language processing, the field of financial services and the field of intelligent medical treatment, in particular to a caching method, a caching device, caching equipment and caching media for large language model key values. Background The large language model has made remarkable progress in the field of natural language processing, and is widely applied to financial services and intelligent medical service scenes, such as multiple scenes of financial analysis text generation, medical question-answering systems and the like. Along with the continuous increase of the large language model scale, the key value cache avoids repeated calculation by storing the intermediate state (key value state vector) of the model in the reasoning process, thereby effectively reducing the calculated amount and improving the reasoning speed, but the existing key value cache method still has a plurality of defects. Firstly, the current storage method of most key values is full-quantity key value cache, and the method directly stores the complete key value state vectors of all layers of the large language model in the cache. However, since the number of layers of the large language model is numerous, and the key value state vector of each layer is high in dimension, the full-scale storage occupies a large amount of storage space. With the continuous expansion of model size, this storage requirement can become burdensome, increasing hardware costs and management difficulties. Second, existing full-size key value caching methods store a large number of rarely used key value state vectors, resulting in low cache hit rates. For example, in a medical assistance setting, the condition of each patient is unique, and even if the symptoms are similar, the specific etiology, stage of progression, etc. may vary. Therefore, the same input has a low probability of repeated occurrence, and the full-size key value cache may store a large number of key value state vectors which are rarely used, resulting in a low cache hit rate. In practical application, the system still needs to perform calculation frequently, and the reasoning speed cannot be effectively improved. The existing fixed layer caching mode is difficult to dynamically adjust the caching layer according to a specific scene, so that the caching effect is poor, and accurate reasoning suggestions can not be provided. For example, in a financial services scenario, different product categories require different levels of information for accurate analysis. For product running trend prediction, more attention is possibly required to the market trend, the basic surface of a company and other bottom information, and the fixed layer cache can not dynamically adjust the cache layer according to a specific investment scene, so that the cache effect is poor, and accurate product adjustment suggestions are difficult to provide. Therefore, how to decompose and interpolate the key value of the large language model to improve the high efficiency of the key value cache is a problem to be solved. Disclosure of Invention In view of this, the embodiments of the present application provide a method, apparatus, device, and medium for caching large language model key values, so as to solve the problem of how to decompose and interpolate the large language model key values to improve the efficiency of key value caching. In a first aspect, an embodiment of the present application provides a method for caching a large language model key value, including: Collecting key value state vectors of N target layers of a large language model after a preset initial layer, wherein N is an integer greater than 1; decomposing each key value state vector to obtain a corresponding direction component and amplitude component; Dividing the N target layers into at least one pair of adjacent target layers according to a hierarchy adjacent relation, and interpolating and combining two direction components of the adjacent target layers aiming at any pair of adjacent target layers to obtain a combined vector corresponding to the adjacent target layers; And forming a shared state vector by the merging vector of the adjacent target layer and the two amplitude components of the adjacent target layer, and writing the shared state vector of all the connected target layers into a cache. In a second aspect, an embodiment of the present application provides a caching apparatus for a large language model key value, including: the vector acquisition module is used for acquiring key value state vectors of N target layers of the large language model after a preset initial layer, wherein N is an integer greater than 1; the vector decomposition module is used for respectively decomposing each key value state vector to obtain a corresponding direction component and amplitude component; The vec