CN-122019686-A - KVCache multiplexing acceleration model pre-filling method and system

CN122019686ACN 122019686 ACN122019686 ACN 122019686ACN-122019686-A

Abstract

The invention discloses a KVCache multiplexing acceleration model pre-filling method and a system. The method comprises an off-line preprocessing stage, a processing stage and a processing stage, wherein the processing stage is used for determining high-frequency text segments Retrieving a set number of similar texts from a knowledge base based on the similarity, performing cross-segment preprocessing and cross-attention fusion of similarity guidance, and generating a text segment containing the text segment And (3) enhancing KVCache of cross attention information among similar texts for constructing KVCache a database, and in the online reasoning stage, matching the prefix of the new user request by utilizing a step-by-step matching strategy, and recalculating the selected key token by combining a key token dynamic selection method based on attention score, so as to output a result responding to the user request. The invention has the advantages of both quality and speed, and good universality.

Inventors

WANG JIAHAO
AI ZHIYUAN
CHEN XIANGLIN
HU YONGSEN

Assignees

北京趋境科技有限责任公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (9)

1. A KVCache multiplexing acceleration model pre-filling method comprises the following steps: offline preprocessing stage for determined high frequency text segment Retrieving a set number of similar texts from a knowledge base based on the similarity, performing cross-segment preprocessing and cross-attention fusion of similarity guidance, and generating a text segment containing the text segment Enhancement KVCache of cross-attention information between similar texts of (a) is used to build KVCache a database; and in the online reasoning stage, a step-by-step matching strategy is utilized to match the prefix of the new user request, and the selected key token is recalculated by combining a key token dynamic selection method based on the attention score, so that a result responding to the user request is output.
2. The method of claim 1, wherein the enhancement KVCache is calculated according to the following steps: For each text segment retrieved Retrieving and retrieving from a knowledge base Top k most similar segments And loads the corresponding pre-cache KVCache ; Prompting the system The inputs of the similar segments KVCache and the original segments are spliced in sequence: The pre-fill calculation is re-run, generating new KVCache as enhancement KVCache according to the following formula: Wherein, the The initial state of KVCache is indicated as empty, Representing text blocks Is not limited to the above-mentioned KVCache, Represent the first And recalling the relevant text blocks.
3. The method according to claim 1, wherein the key token is obtained according to the steps of: the Transformers model generates a final layer of Query matrix independently according to the user questions: For each text segment Taking key matrix of final layer An attention score is calculated by an attention mechanism, expressed as: Where d represents the model hidden layer size, Represent the first Attention score of individual text blocks; Summing each column to obtain a length of Key Tokens fraction of (a): Wherein, the Represent the first In the text block Key scores for the token; Key Tokens to the set scale threshold is chosen based on the score.
4. The method of claim 2, wherein the text segments are retrieved from the knowledge base using a retrieval enhancement generation RAG or vector retrieval The most similar Topk texts.
5. The method of claim 1, wherein the step-wise matching strategy comprises: Firstly searching the longest prefix which is completely matched with the prefix input by the current user request in a global cache, and directly multiplexing KVCache of the prefix if the longest prefix is hit; If the prefixes do not match exactly, then the substrings that can match consecutively in the lookup input request are multiplexed using KVCache that is enhanced offline.
6. A method according to claim 3, wherein the ratio threshold is a preset fixed ratio or a dynamically adjusted ratio threshold.
7. A KVCache multiplexed acceleration model pre-fill system, comprising: an offline preprocessing module for determining high-frequency text segments Retrieving a set number of similar texts from a knowledge base based on the similarity, performing cross-segment preprocessing and cross-attention fusion of similarity guidance, and generating a text segment containing the text segment Enhancement KVCache of cross-attention information between similar texts of (a) is used to build KVCache a database; And the online reasoning module is used for matching the prefix of the new user request by utilizing a step-by-step matching strategy, and recalculating the selected key token by combining a key token dynamic selection method based on the attention score, so as to output a result responding to the user request.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 6.
9. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

Description

KVCache multiplexing acceleration model pre-filling method and system Technical Field The invention relates to the technical field of artificial intelligence, in particular to a KVCache multiplexing acceleration model pre-filling method and system. Background RAG (RETRIEVAL-Augmented Generation, search enhancement generation) is a hybrid reasoning framework that combines external knowledge search with a generative model. Unlike conventional pure generation or pure retrieval systems, RAG introduces the cooperation of a Retriever (Retriever) and a Generator (producer) in the generation phase. Firstly, selecting a plurality of document fragments most relevant to an input query from a large-scale document library by using a vector retrieval or sparse retrieval technology, and then inputting the fragments together with the original query into a model as a context, thereby providing real-time supplementary external knowledge. Attention technology is a computationally intensive neural network mechanism for dynamically allocating computing resources based on correlations between input locations as sequence data is processed. The method comprises the steps of mapping the features of each position into three representations of 'Query', 'Key' and 'Value', calculating similarity scores between the Query and all keys, and then weighting and summing the values by using the scores to generate a new context representation. The Attention calculation formula is as follows: The large language model (Large Language Models, LLMs) is a pre-training model based on a deep neural network, and deep representation capability of language context is obtained through self-supervision learning of massive text data. Unlike conventional natural language processing systems based on rule or shallow feature engineering, LLMs typically possess parameter scales on the order of billions or even trillions, enabling fast migration and application at zero-shot, few-shot, or fine-tuning (fine-tuning) levels in a variety of downstream tasks. The core architecture of the large language model adopts a transducer structure, realizes the efficient capture of long-distance dependency through a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention), and has various capabilities of text generation, abstract, translation, question-answering and the like by means of a layered encoder-decoder or pure decoder architecture. LLMs can flexibly control in the reasoning stage through prompt engineering (Prompt Engineering) and complete new tasks without retraining, and meanwhile, new knowledge can be continuously absorbed through continuous online learning and incremental updating, so that the continuous adaptability and expandability of the model are enhanced. Before the generation task begins, the model will first make a complete forward calculation of the user-provided "context" text (promt), a process called Prefill (fill phase). Prefill has the core effect of calculating a self-attention Key Value pair (KV) cache, that is, for each existing token in the template, the transducer calculates the corresponding Key and Value at each layer through the self-attention module and stores them in the cache (KVCache), generating a context representation, that is, at the same time, the model outputs an implicit state representation (HIDDEN STATES) of each layer, providing rich context information for the subsequent generation stage, and that is, the most costly part of the one-time completion, that is, because the length of the template is often long, prefill requires performing a complete attention calculation for each location, which is the most time-consuming step in the reasoning process. The processing speed at stage Prefill affects the time at which the model generates the first Token (TimeToFirstToken, TTFT). The Preifll-stage process flow can be summarized as follows: Wherein the method comprises the steps of For the purpose of text input,For the length of the input it is,For position index, stage Prefill will process input X as。 Decoding stage after completion Prefill, the model enters the decoding stage, i.e. the process of generating new Token step by step. Each time a new Token is generated, the model only needs to perform attention calculation on the Query at the location and the KV cached before, and does not need to repeatedly calculate the whole context, so that the calculation amount of each step is reduced from O (n 2) to O (n) (n is the number of generated tokens). The newly generated Token is converted into a corresponding KV, added to the cache for the next decoding, and the hidden state of the last layer is updated at the same time so as to follow up the context change. The Decode stage is time-steppedThe treatment process is as follows: through analysis, in a high-knowledge multiplexing scenario such as RAG, a user's query often contains repeated or similar text fragments. The prior art attempts to speed up pre-filling by multiplexing KVCache