CN-122019408-A - Key value cache scheduling method and system for language model reasoning

CN122019408ACN 122019408 ACN122019408 ACN 122019408ACN-122019408-A

Abstract

The invention provides a key value cache scheduling method and a key value cache scheduling system for language model reasoning, which are applied to the technical fields of video memory management and key value cache multiplexing of a large-scale language model reasoning engine; the method and the device comprise the steps of obtaining an input sequence, extracting candidate multiplexing fragments, calculating fragment hash values of the input sequence, inquiring a corresponding historical logical block hash list in a fragment index table, generating a current logical block hash list for the fragments in a pre-filling process, mapping the current logical block hash to a physical address of a historical physical key value cache block through a prefix hash table, realizing logical hash aliases, and scheduling key value cache data in a bottom display memory to participate in autoregressive decoding.

Inventors

LI ZHIYU
Zhang Quqing
CHEN KAI
XIONG FEIYU

Assignees

记忆张量(上海)科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A key value cache scheduling method for language model reasoning is applied to computer equipment and is characterized by comprising the following steps: acquiring an input sequence of a current reasoning request, and extracting candidate multiplexing fragments from the input sequence; Calculating a fragment hash value of the candidate multiplexing fragment, and inquiring the fragment hash value in a fragment index table to obtain a historical logic block hash list corresponding to the fragment hash value, wherein the fragment index table records the mapping relation between the fragment hash value with independent semantics and the corresponding logic block hash list; Generating a corresponding current logic block hash list for the candidate multiplexing segment according to the current context of the input sequence in the process of executing pre-filling calculation by the language model; In a prefix hash table, updating a physical address pointer corresponding to the current logical block hash list into a physical address of a historical physical key value cache block pointed by the historical logical block hash list so as to establish logical hash alias mapping; And scheduling key value cache data corresponding to the historical physical key value cache block in the bottom-layer video memory according to the updated physical address pointer in the prefix hash table so that the key value cache data participates in autoregressive decoding calculation of the language model.
2. The method of claim 1, wherein the extracting candidate multiplexed fragments from the input sequence comprises: extracting text substrings by analyzing explicit boundary labels carried in the input sequence, and taking the text substrings as the candidate multiplexing fragments; or identifying the upstream output text sequence referenced in the input sequence according to a workflow default rule, and taking the upstream output text sequence as the candidate multiplexing segment.
3. The method of claim 1, comprising, prior to calculating the fragment hash value for the candidate multiplexed fragment: Performing text normalization processing on the candidate multiplexing fragments, wherein the text normalization processing comprises at least one operation of blank character removal operation, unified punctuation operation and conversion case operation; or converting the candidate multiplexing fragments into a word element sequence through a word segmentation device of the language model, and calculating the fragment hash value based on the word element sequence.
4. The method of claim 1, wherein updating the physical address pointer corresponding to the current logical block hash list to the physical address of the historical physical key value cache block pointed to by the historical logical block hash list comprises: traversing the current logical block hash in the current logical block hash list; and replacing the value of the key value pair taking the current logic block hash as an index in the prefix hash table with the offset mapped by the history logic block hash of the history logic block hash list in the corresponding sequence position in the bottom-layer video memory.
5. The method of claim 1, after establishing the logical hash alias mapping, comprising: comparing the current position coding information of the candidate multiplexing segment in the current context with the historical position coding information of the historical context; And triggering a position code offset calibration mechanism under the condition that the current position code information is inconsistent with the historical position code information.
6. The method of claim 1, wherein updating the physical address pointer corresponding to the current logical block hash list to the physical address of the historical physical key value cache block pointed to by the historical logical block hash list comprises: And in the bottom-layer video memory management pool of the language model, executing an atom adding operation on a reference counter corresponding to the historical physical key value cache block so as to intercept asynchronous release operation and dirty write operation executed on the historical physical key value cache block by a concurrent reasoning request.
7. The method of claim 1, comprising, after completing the autoregressive decoding computation and generating an output sequence: And cutting the output sequence into target product fragments according to one index of the intelligent agent role identification, the tool calling instruction boundary and the maximum word element length threshold.
8. The method of claim 1, wherein the mapping relationship in the segment index table is bound with at least one of a task domain isolation identity and a tenant authority identity.
9. The method of claim 1, wherein the computing the segment hash value for the candidate multiplexed segment and querying a segment index table for the segment hash value comprises: performing approximate matching query in the segment index table according to the local sensitive hash value to obtain a target mapping item; extracting complete sequence features of the candidate multiplexing fragments to execute secondary accurate hash anti-collision comparison operation under the condition that the similarity of the target mapping items exceeds a preset threshold value; And under the condition that the secondary accurate hash anti-collision comparison operation check passes, acquiring the historical logic block hash list corresponding to the fragment hash value.
10. A key-value cache scheduling system for language model reasoning, comprising: the input analysis module is configured to acquire an input sequence of a current reasoning request and extract candidate multiplexing fragments from the input sequence; The index addressing module is configured to calculate a fragment hash value of the candidate multiplexing fragment, inquire the fragment hash value in a fragment index table, and acquire a historical logic block hash list corresponding to the fragment hash value, wherein the fragment index table records the mapping relation between the fragment hash value with independent semantics and the corresponding logic block hash list; The context distribution module is configured to generate a corresponding current logic block hash list for the candidate multiplexing segment according to the current context of the input sequence in the process of executing the pre-filling calculation by the language model; The address redirection module is configured to update a physical address pointer corresponding to the current logical block hash list into a physical address of a historical physical key value cache block pointed by the historical logical block hash list in a prefix hash table so as to establish logical hash alias mapping; And the decoding execution module is configured to schedule key value cache data corresponding to the historical physical key value cache block in the bottom video memory according to the updated physical address pointer in the prefix hash table so as to enable the key value cache data to participate in autoregressive decoding calculation of the language model.

Description

Key value cache scheduling method and system for language model reasoning Technical Field The invention relates to the technical field of artificial intelligence and computer architecture crossing, in particular to a key value cache scheduling method and system for language model reasoning. Background In the large language model reasoning process, the key value cache is used for storing the key tensor and the value tensor which are calculated in the attention mechanism, so that short-term working memory is formed when the agent executes the task. In the prior art, a prefix caching strategy is commonly adopted, and corresponding key value caches can be multiplexed only when the input sequence prefix of a new request is completely consistent with the sequence prefix calculated by history. The strategy relies on a prefix tree structure to realize cache matching, and has obvious limitation in a multi-agent cooperation scene. The system hint templates, role settings and context organization used by each agent are highly heterogeneous, resulting in that even if the input sequence contains exactly the same semantic segments, cache multiplexing cannot be triggered only due to prefix differences. This causes a large number of repeated pre-fill computations, significantly increasing the first word delay and exacerbating graphics processor memory footprint. In addition, the rotation position coding enables the key value tensor to be in depth coupling with the absolute position, and direct multiplexing of physical cache blocks across contexts can lead to position information dislocation and influence model output stability. Meanwhile, when a plurality of agents access the same historical key value cache block concurrently, a cooperative management and control mechanism for the life cycle of the physical video memory block is lacking, and system-level errors such as dirty writing or release after use are easy to occur. Therefore, a key value cache scheduling method capable of breaking through prefix constraint, supporting arbitrary position semantic segment multiplexing, compatible position coding calibration and guaranteeing concurrency safety is needed. It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art. Disclosure of Invention In view of the above, the invention provides a key value cache scheduling method and system for language model reasoning, which aim to solve the technical problems that in the prior art, key value cache multiplexing granularity is too coarse, cross-context semantic segments cannot be multiplexed, working memories among multiple agents cannot be shared, generation quality is reduced due to position coding dislocation and physical memory management is unsafe under concurrent access, and the invention aims to solve the problems that in the prior art, a semantic segment index and logic hash preassigned name decoupling architecture is constructed, bottom pointer overwriting is performed in a prefix hash table to realize zero copy cross-context multiplexing, and a reference count based on atomic operation is introduced to drive memory life cycle management, thereby supporting efficient, safe and high-quality KV cache multiplexing of repeated semantic segments at any position, remarkably reducing first word delay (TTFT), reducing GPU memory peak occupation and guaranteeing system robustness of multi-agent concurrent reasoning. The embodiment of the invention provides a key value cache scheduling method for language model reasoning, which is applied to computer equipment and comprises the following steps: acquiring an input sequence of a current reasoning request, and extracting candidate multiplexing fragments from the input sequence; Calculating segment hash values of candidate multiplexing segments, inquiring the segment hash values in a segment index table, and obtaining a historical logical block hash list corresponding to the segment hash values, wherein the segment index table records the mapping relation between the segment hash values with independent semantics and the corresponding logical block hash list; Generating a corresponding current logic block hash list for the candidate multiplexing segment according to the current context of the input sequence in the process of executing the pre-filling calculation by the language model; in the prefix hash table, updating a physical address pointer corresponding to the current logical block hash list into a physical address of a historical physical key value cache block pointed by a historical logical block hash list so as to establish logical hash alias mapping; and scheduling key value cache data corresponding to the historical physical key value cache blocks in the bottom-layer video memory according to the updat