CN-122019737-A - Dialogue method, device, equipment and medium based on KV cache and large language model

CN122019737ACN 122019737 ACN122019737 ACN 122019737ACN-122019737-A

Abstract

The application discloses a dialogue method, a dialogue device, a dialogue equipment and a dialogue medium based on a KV cache and a large language model, which relate to the technical field of large models and comprise the steps of acquiring and storing the KV cache in real time and generating a semantic abstract vector according to a KV cache screening result; determining a target score according to the occurrence record of the semantic unit block, determining whether to store the semantic unit block and the semantic abstract vector into a first database based on the target score, storing the first KV cache into a second database, converting a new problem of a user into a query vector, retrieving the target semantic abstract vector, the semantic similarity of which meets a target threshold value, in the first database, of the query vector, determining a second KV cache based on the target semantic abstract vector, splicing the second KV cache to obtain spliced data, adjusting the spliced data by using a target lightweight neural network, and reasoning based on the adjusted data and a large language model to obtain a response corresponding to the new problem of the user. Efficient, long-term multiplexing of conversation histories is achieved.

Inventors

YU JINJUN
ZHANG YAN
WANG JUN

Assignees

煜象科技(杭州)有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (11)

1. A dialogue method based on KV cache and large language model, characterized by comprising: Acquiring and storing KV caches generated by each dialogue in the large language model reasoning process in real time, screening the KV caches according to the attention weight of the large language model, and generating semantic abstract vectors according to corresponding screening results; Determining a target score according to occurrence records of semantic unit blocks in a history multi-round dialogue, and determining whether to store the semantic unit blocks and semantic abstract vectors corresponding to the semantic unit blocks into a first database based on the target score, wherein the semantic unit blocks are used for representing a group of continuous words or words with vector cosine similarity meeting a preset threshold; Storing a first KV cache meeting preset conditions in the KV cache to a second database, converting a new problem of a user into a query vector, retrieving a target semantic abstract vector, the semantic similarity of which with the query vector meets a target threshold, in the first database, and determining a second KV cache based on the target semantic abstract vector, wherein the query vector is a vector with the same dimension as the semantic abstract vector; And splicing the second KV cache to obtain corresponding spliced data, adjusting the spliced data by using a target lightweight neural network, and reasoning based on the corresponding adjusted data and the large language model to obtain a response corresponding to the new problem of the user.
2. The dialogue method based on the KV cache and the large language model according to claim 1, wherein the KV cache generated by each dialogue in the reasoning process of the large language model is obtained and stored in real time, and the dialogue method comprises the following steps: Acquiring KV caches of each output layer of the large language model in real time through a hook function; and serializing the KV cache into a binary file according to the dialogue round ID and the mode that the output layer number is a key, and storing the binary file into a memory buffer area.
3. The dialogue method based on KV cache and large language model according to claim 1, wherein the filtering the KV cache according to the attention weight of the large language model, generating a semantic abstract vector according to a corresponding filtering result, comprises: determining attention weights according to an attention matrix of the last layer of the decoder of the large language model; Performing value scoring on each word of the KV cache according to the attention weight to obtain a corresponding scoring result; Screening the KV cache according to the magnitude relation between the scoring result and a first threshold value to obtain a corresponding screening result; flattening the screening result into a one-dimensional vector, compressing the one-dimensional vector into a low-dimensional vector through a self-encoder, and determining the low-dimensional vector as the semantic abstract vector.
4. The KV cache and large language model based dialog method according to claim 1, wherein the determining the target score according to the occurrence record of the semantic unit block in the historical multi-turn dialog comprises: determining the occurrence frequency, the weight mean value and the weight variance of the semantic unit blocks according to the occurrence records of the semantic unit blocks in the historical multi-round conversations, wherein the weight mean value is the attention weight mean value when the semantic unit blocks appear in a calendar mode; determining a target sum of 1 and the weight variance; Determining a target ratio between the weighted mean and the target sum; determining a product between the frequency of occurrence and the target ratio as a target score; correspondingly, the determining whether to store the semantic unit block and the semantic abstract vector corresponding to the semantic unit block to a first database based on the target score includes: And if the target score is larger than a second threshold value, storing the semantic unit block and the semantic abstract vector corresponding to the semantic unit block into a first database.
5. The dialogue method based on KV cache and large language model according to claim 3, wherein storing the first KV cache satisfying a preset condition in the KV cache to the second database comprises: And storing a first KV cache with the scoring result larger than the first threshold value in the KV cache of the target round number dialogue to the second database according to a circular queue structure.
6. The KV cache and large language model based dialogue method according to claim 1, wherein the retrieving the target semantic abstract vector in the first database, whose semantic similarity with the query vector satisfies a target threshold, comprises: And searching a target semantic abstract vector, the semantic similarity of which with the query vector meets a target threshold, in the first database through an approximate nearest neighbor search algorithm.
7. The KV cache and large language model based dialogue method of claim 1, wherein the determining the second KV cache based on the target semantic abstract vector comprises: Querying an associated screening result ID in a metadata table based on the ID of the target semantic abstract vector, and determining a second KV cache from the first database or the second database based on the screening result ID.
8. The dialogue method based on KV cache and large language model according to any one of claims 1 to 7, wherein the splicing the second KV cache to obtain corresponding spliced data includes: splicing the second KV cache according to the word sequence of the second KV cache to obtain corresponding spliced data; correspondingly, the adjusting the spliced data by using the target lightweight neural network, and reasoning based on the corresponding adjusted data and the large language model to obtain a response corresponding to the new problem of the user, including: And adjusting the spliced data by utilizing a target lightweight neural network, and storing the corresponding adjusted data into a decoder KV buffer area of the large language model so that the large language model generates a response corresponding to the new user problem based on the adjusted data and the text segmentation of the new user problem.
9. A dialogue device based on KV cache and large language model, characterized by comprising: The cache screening module is used for acquiring and storing KV caches generated by each dialogue in the large language model reasoning process in real time, screening the KV caches according to the attention weight of the large language model, and generating semantic abstract vectors according to corresponding screening results; the system comprises a vector storage module, a semantic unit block, a first database and a second database, wherein the vector storage module is used for determining a target score according to the occurrence record of the semantic unit block in a history multi-round dialogue, and determining whether to store the semantic unit block and a semantic abstract vector corresponding to the semantic unit block to the first database or not based on the target score; The second KV cache determining module is used for storing a first KV cache meeting preset conditions in the KV caches to a second database, converting a new problem of a user into a query vector, retrieving a target semantic abstract vector, the semantic similarity of the target semantic abstract vector and the query vector of which meets a target threshold, in the first database, and determining a second KV cache based on the target semantic abstract vector; the query vector is a vector with the same dimension as the semantic abstract vector; and the reasoning module is used for splicing the second KV cache to obtain corresponding spliced data, adjusting the spliced data by utilizing a target lightweight neural network, and reasoning based on the corresponding adjusted data and the large language model to obtain a response corresponding to the new problem of the user.
10. An electronic device, comprising: A memory for storing a computer program; A processor for executing a computer program to implement the steps of the KV cache and large language model based dialog method as claimed in any of claims 1 to 8.
11. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the KV cache and large language model based dialogue method according to any one of claims 1 to 8 are implemented.

Description

Dialogue method, device, equipment and medium based on KV cache and large language model Technical Field The invention relates to the technical field of large models, in particular to a dialogue method, a dialogue device, a dialogue equipment and a dialogue medium based on KV cache and a large language model. Background During autoregressive reasoning of a large language model (Large Language Model, LLM), such as the GPT (GENERATIVE PRE-trained Transformer, generative pre-training Transformer) series, LLaMA series, the model generates output Token by Token through a transducer decoder. In order to improve efficiency, the model will buffer the Key vector (Key) and the Value vector (Value) of the generated Token in the memory to form a "KV buffer" in order to avoid repeated computation of the history Token. As LLM application scenarios expand to long contexts (e.g., 10ten+ Token), memory occupation (single session can reach tens of GB (Gigabyte gigabytes)) and computational redundancy of KV caches become bottlenecks. The existing known optimization techniques are mainly divided into three categories: memory compression, namely reducing the cache volume through quantization (such as FP16- > INT 8), pruning (removing KV of low attention weight Token) and sharing (multi-head attention KV dimension reduction sharing). Buffer multiplexing, namely buffering KV for repeatedly occurring text prefixes (such as system sympt and fixed instructions) and avoiding repeated calculation (namely 'Prefix buffering', prefix Caching). State management-temporary inactive cache by external storage, on demand loading (i.e., "store-in-place"). However, in some current schemes, the method stays at the level of 'optimizing KV cache associated with text symbols', and the problems of 'similar semantics but different texts', 'long context state redundancy', 'cross-session state freshness preservation', and the like cannot be solved. Therefore, how to solve the defects caused by the operation text symbol but not the calculation state, the static abstract ignoring attention value, the storage non-semanteme long-term and the multiplexing dependent complete cache in the prior art, and finally realize the intelligent long-term multiplexing of the calculation state is a problem to be solved urgently at present. Disclosure of Invention In view of the above, the present invention aims to provide a dialogue method, device, equipment and medium based on KV cache and large language model, which can solve the defects caused by "operation text symbol but not calculation state", "static abstract ignoring attention value", "storage un-semanteme long-term", "multiplexing dependent complete cache" in the prior art by reconstructing management object, storage strategy and multiplexing mode of KV cache, and finally realize intelligent long-term multiplexing of calculation state. The specific scheme is as follows: In a first aspect, the application discloses a dialogue method based on KV cache and a large language model, which comprises the following steps: Acquiring and storing KV caches generated by each dialogue in the large language model reasoning process in real time, screening the KV caches according to the attention weight of the large language model, and generating semantic abstract vectors according to corresponding screening results; Determining a target score according to occurrence records of semantic unit blocks in a history multi-round dialogue, and determining whether to store the semantic unit blocks and semantic abstract vectors corresponding to the semantic unit blocks into a first database based on the target score, wherein the semantic unit blocks are used for representing a group of continuous words or words with vector cosine similarity meeting a preset threshold; Storing a first KV cache meeting preset conditions in the KV cache to a second database, converting a new problem of a user into a query vector, retrieving a target semantic abstract vector, the semantic similarity of which with the query vector meets a target threshold, in the first database, and determining a second KV cache based on the target semantic abstract vector, wherein the query vector is a vector with the same dimension as the semantic abstract vector; And splicing the second KV cache to obtain corresponding spliced data, adjusting the spliced data by using a target lightweight neural network, and reasoning based on the corresponding adjusted data and the large language model to obtain a response corresponding to the new problem of the user. Optionally, the acquiring and storing in real time the KV cache generated by each dialogue in the large language model reasoning process includes: Acquiring KV caches of each output layer of the large language model in real time through a hook function; and serializing the KV cache into a binary file according to the dialogue round ID and the mode that the output layer number is a key, and storing the binary file into a memor