CN-122019209-A - KV cache management method and device, electronic equipment and storage medium

CN122019209ACN 122019209 ACN122019209 ACN 122019209ACN-122019209-A

Abstract

The invention relates to the technical field of artificial intelligence and provides a KV cache management method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of responding to a large model reasoning request, and acquiring a key value pair KV block to be managed; the method comprises the steps of collecting characteristic attributes of KV blocks, wherein the characteristic attributes at least comprise recalculation cost indexes, access heat indexes and life cycle indexes, calculating retention scores of the KV blocks based on the characteristic attributes, wherein the higher the recalculation cost indexes are, the higher the retention scores are, monitoring the resource occupation state of a current KV cache storage area, selecting KV blocks with the retention scores meeting preset elimination rules from the KV cache storage area as target KV blocks based on the retention scores when the resource occupation state meets an eviction condition, and executing grading processing operation on the target KV blocks. The invention combines the recalculation cost factor when eliminating KV blocks, also avoids KV with higher recalculation cost from being eliminated from KV cache in advance, and improves the hit rate of KV cache access.

Inventors

WANG FEI
JIANG JIELONG
LIN DELONG
ZHOU QIFENG
LI DAOYUAN
YANG TIANYU

Assignees

广电运通集团股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. The KV cache management method is characterized by comprising the following steps of: Responding to a large model reasoning request, and acquiring a key value pair KV block to be managed; Collecting characteristic attributes of the KV block, wherein the characteristic attributes at least comprise recalculation cost indexes, access heat indexes and life cycle indexes; calculating a retention score of the KV block based on the characteristic attribute, wherein the higher the recalculation cost index is, the higher the retention score is; monitoring the resource occupation state of the current KV cache memory area; when the resource occupation state meets an eviction condition, selecting KV blocks with retention scores meeting a preset elimination rule from the KV cache storage area as target KV blocks based on the retention scores; and executing hierarchical processing operation on the target KV block.
2. The KV cache management method according to claim 1, wherein the recalculation cost indicator is determined in at least one of the following manners: Determining based on a length of time required to recalculate the KV block; determining based on computing resources required for recalculating the KV block; and determining based on weighted summation of time length and calculation resources required for recalculating the KV block.
3. The KV cache management method of claim 1, wherein said KV cache memory area comprises a hot layer and a cold layer, wherein performing hierarchical processing operations on said target KV block comprises: when the retention score of the KV block positioned on the hot layer is smaller than a first score threshold, migrating the KV block as a target KV block to the cold layer, and compressing and storing the target KV block; When a KV block positioned on a cold layer is accessed and the updated retention score is larger than a second score threshold, migrating the KV block from the cold layer to a hot layer, decompressing and storing the KV block, wherein the first score threshold is smaller than or equal to the second score threshold; And when the release space is needed, N KV blocks with the smallest retention scores are selected from the cold layer to be released, wherein N is larger than 0.
4. The KV cache management method of claim 3, wherein the first scoring threshold is less than the second scoring threshold; the step of migrating the KV block as a target KV block to a cold layer and performing compression storage on the target KV block comprises the following steps: Judging whether KV blocks positioned in a thermal layer storage area are within a preset minimum residence time period, wherein the retention scores calculated continuously for many times are smaller than the first score threshold value; The step of migrating the KV block from a cold layer to a hot layer and decompressing and storing the KV block comprises the following steps: and judging whether the KV block positioned in the cold layer storage area is within the preset minimum residence time, continuously calculating the retention scores for multiple times to be larger than the second score threshold value, and if so, executing the migration and decompression storage operation.
5. The KV cache management method according to claim 1, wherein performing a hierarchical processing operation on the target KV block comprises: generating fingerprint data of the target KV block; Storing the fingerprint data in an auxiliary cache region different from the KV cache memory region; when the KV block of the auxiliary cache region is accessed, determining whether the KV block accessed by the auxiliary cache region is added into the KV cache region or not based on a preset admittance rule.
6. The KV cache management method according to claim 1, wherein the acquiring the characteristic attribute of the KV block further comprises acquiring an identity of a tenant to which the current reasoning request belongs; The method further comprises the steps of generating an isolation naming space or an encryption key based on the identity of the tenant, binding and storing the KV block and the isolation naming space or the encryption key, and checking whether the identity of the inquirer is matched with the isolation naming space or the encryption key bound with the KV block when a cache inquiry request is processed.
7. The KV cache management method according to any one of claims 1 to 6, wherein calculating a retention score of the KV block based on the characteristic attribute comprises calculating a retention score of the KV block based on the characteristic attribute and a corresponding attribute weight weighting; the method further comprises the steps of: Acquiring each performance index value of the last batch of reasoning requests acquired in a period of time; Substituting each performance index value into an objective function pre-constructed by each performance index, wherein the constraint condition of the objective function is that each weight is greater than or equal to 0, the sum of the weights is equal to 1, and the objective function is used for representing the overall reasoning efficiency and/or the cache hit rate of the cache system; Calculating the gradient of the objective function to solve the objective function to obtain the weights of all the attributes after the gradient of the objective function changes; And performing simplex projection or non-negative projection on each attribute weight after the gradient change of the objective function to obtain each attribute weight when the objective function is solved and the minimum value is taken.
8. The KV cache management device is characterized by comprising: the KV block acquisition module is used for responding to the large model reasoning request and acquiring a key value pair KV block to be managed; The characteristic attribute acquisition module is used for acquiring characteristic attributes of the KV block, wherein the characteristic attributes at least comprise recalculation cost indexes, access heat indexes and life cycle indexes; The retention score calculation module is used for calculating the retention score of the KV block based on the characteristic attribute, wherein the higher the recalculation cost index is, the higher the retention score is; The KV cache state monitoring module is used for monitoring the resource occupation state of the current KV cache storage area; The target KV block selection module is used for selecting KV blocks with retention scores meeting a preset elimination rule from the KV cache storage area as target KV blocks based on the retention scores when the resource occupation state meets an eviction condition; And the grading processing module is used for executing grading processing operation on the target KV block.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the KV cache management method according to any of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the KV cache management method according to any one of claims 1 to 7.

Description

KV cache management method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a KV cache management method, apparatus, electronic device, and storage medium. Background With the popularity of large-scale pre-trained language models (Large Language Model, LLM) in dialog systems, search enhancement generation (RAG), code generation, and multi-modal applications, the computational cost and time delay of the inference stage become key factors that limit the on-line service experience and computational effort costs. To reduce duplicate computation, the industry commonly caches intermediate states of self-attention computation, KV, in the reasoning process, multiplexing KV in the cache when identical or similar prefixes occur, thereby shortening the first token delay (TTFT) and improving throughput. Such KV caches may reside in GPU/CPU memory as well as often fall on common key-value systems (e.g., distributed memory/persistent KV) to multiplex across processes and across nodes. At present, the KV cache management scheme has the following problems: 1. Simple replacement strategies have limited effectiveness in that it is difficult to account for the KV block recalculation cost difference with only Least recently used (LEAST RECENTLY used, LRU) and/or Least recently used (Least-frequently used, LFU) clean-up caching strategies, and entries that are "recalculation costly" are eliminated prematurely. 2. The performance is unstable and the capacity is uncontrollable, in the long context and high concurrency scene, KV cache hit rate fluctuation, TTFT (Time to First Token) and P95/P99 tail delay jitter are obvious, the occupation of external KV (such as a memory database) is continuously increased, and writing failure or a large amount of jitter elimination occurs after the external KV is touched. 3. Synchronous clearing of blocked main paths, elimination/reclamation and reasoning threads are synchronously executed, so that request jitter and throughput are reduced. 4. The distributed/cross-node coordination is insufficient, each node is isolated and decided, and a global or fragmented hotness view is lacking, so that hot spot mismatch and ineffective migration are caused. 5. And the multi-tenant isolation is weak, namely, when a cache key is not bound with tenant context or content deduplication/sharing is carried out on different tenants, cross-tenant recall and information leakage can occur, and if the hot abstract exchange is not desensitized and encrypted, the risk of exposing a use track exists. Disclosure of Invention The invention provides a KV cache management method, a KV cache management device, electronic equipment and a storage medium, which are used for solving at least one technical problem existing in the prior art. The invention provides a KV cache management method, which comprises the following steps: Responding to a large model reasoning request, and acquiring a key value pair KV block to be managed; Collecting characteristic attributes of the KV block, wherein the characteristic attributes at least comprise recalculation cost indexes, access heat indexes and life cycle indexes; calculating a retention score of the KV block based on the characteristic attribute, wherein the higher the recalculation cost index is, the higher the retention score is; monitoring the resource occupation state of the current KV cache memory area; when the resource occupation state meets an eviction condition, selecting KV blocks with retention scores meeting a preset elimination rule from the KV cache storage area as target KV blocks based on the retention scores; and executing hierarchical processing operation on the target KV block. According to the KV cache management method provided by the invention, the recalculation cost index is determined according to at least one of the following modes: Determining based on a length of time required to recalculate the KV block; determining based on computing resources required for recalculating the KV block; and determining based on weighted summation of time length and calculation resources required for recalculating the KV block. According to the KV cache management method provided by the invention, the KV cache storage area comprises a hot layer and a cold layer, and the KV cache management method comprises the steps of: when the retention score of the KV block positioned on the hot layer is smaller than a first score threshold, migrating the KV block as a target KV block to the cold layer, and compressing and storing the target KV block; When a KV block positioned on a cold layer is accessed and the updated retention score is larger than a second score threshold, migrating the KV block from the cold layer to a hot layer, decompressing and storing the KV block, wherein the first score threshold is smaller than or equal to the second score threshold; And wh