CN-121478678-B - Hybrid KV cache management method and device, electronic equipment and storage medium

CN121478678BCN 121478678 BCN121478678 BCN 121478678BCN-121478678-B

Abstract

The application provides a hybrid KV cache management method, a hybrid KV cache management device, electronic equipment and a storage medium. The mixed KV cache management method solves the problem that the display memory is insufficient in long-sequence pushing and the reasoning efficiency is affected, and can improve the utilization rate of the display memory when the storage media such as the GPU display memory and the CPU memory are comprehensively utilized for mixed KV cache, so that the use of the display memory in the mixed KV cache management process is optimized, and the reasoning efficiency is improved. According to access data of time sequence characteristics of KV cache blocks in a large model long sequence reasoning process, a training set is constructed, an LSTM model is trained, and cache management decisions (such as unloading, loading and the like) are predicted and guided through the LSTM model. The method can learn how to dynamically adjust the cache management strategy from the historical data, thereby optimizing the use of the video memory in the long-sequence reasoning mixed KV cache process of the large language model.

Inventors

LIU XIAOYU
KONG LIJUAN
MEI FEI
CHENG WEN
Teng Huigang
YU BO
ZHU CHUNJIE
HE SHUIBING
ZENG LINGFANG

Assignees

之江实验室

Dates

Publication Date: 20260508
Application Date: 20260106

Claims (9)

1. The mixed KV cache management method is characterized by comprising the following steps of: Extracting time sequence characteristics in historical data accessed by each KV cache block in a historical time period in a large model long sequence reasoning process, wherein the cache blocks refer to KV cache units in a reasoning frame of the large model in each time window, constructing a data set of an LSTM model based on the time sequence characteristics, wherein a label of the LSTM model is used for representing whether the cache block is accessed again in the next time window, a value of 1 is used for representing that the cache block is accessed in the next time window, a value of 0 is used for representing that the cache block is not accessed in the next time window, the data set is data with time sequence characteristics and comprises a training set and a verification set, and the time sequence characteristics comprise at least one of time acquired by the time sequence characteristics, access type, data quantity related to current access, logic distance between a current page and a last access page, accumulated residence time of the current page in a display memory and real-time load index of a system; training an LSTM model by using the training set; the trained LSTM model is deployed in an inference framework of the large model and is used for determining whether the next time period of KV data is accessed or not so as to guide a long-sequence inference cache decision; The method comprises the steps of acquiring access data with time sequence characteristics of a user, wherein the access data comprises access information of KV data corresponding to at least one token, and the access data comprises a long sequence formed by a document and an instruction and/or a dialogue reasoning instruction; Inputting first KV data corresponding to part of the token into the trained LSTM model so that the LSTM model can determine whether the first KV data is accessed in the next time period or not, storing the first KV data into a video memory according to the label if the first KV data is accessed, storing the first KV data into a memory if the first KV data is not accessed, storing the first KV data into the memory, and storing the first KV data into the video memory and the second KV data which are accessed in the next time period in the memory preferentially.
2. The hybrid KV cache management method of claim 1, wherein the evaluating the trained LSTM model using the validation set comprises: using the verification set to evaluate whether the access prediction accuracy of the trained LSTM model to KV data reaches the target accuracy; if not, continuing to train the model until the target accuracy is reached, and obtaining a trained LSTM model.
3. The hybrid KV cache management method of claim 2, wherein the evaluating, using the verification set, whether the access prediction accuracy of the trained LSTM model to KV data reaches the target accuracy comprises: Predicting the accessed condition of a next time window of the cache by using the historical data of a first time period of a historical time period through an LSTM model, wherein the time window is a time step; and comparing the predicted accessed condition with the actual accessed condition in the same time period of the verification set, and evaluating whether the access prediction accuracy of the trained LSTM model to KV data reaches the target accuracy.
4. The hybrid KV cache management method of claim 1, wherein the access data comprises a long sequence of document plus instructions.
5. The hybrid KV cache management method of claim 1, wherein the access data comprises a dialogue inference instruction.
6. The hybrid KV cache management method according to any one of claims 1 to 3, further comprising, prior to training the LSTM model using the training set: Determining structural parameters of the LSTM model, wherein the structural parameters comprise hidden layer dimensions, stacking layers, a bidirectional LSTM structure, an output layer and an activation function; Determining super parameters of the LSTM model, wherein the LSTM comprises a loss function, an activation function, an optimizer and a learning rate, the loss function is a two-class cross entropy loss function, the activation function is sigmod functions, the optimizer is an Adam algorithm, and the learning rate is an exponential decay learning rate.
7. The utility model provides a hybrid KV buffer management device which characterized in that includes: The system comprises a data collection module, a label and a value 0, wherein the data collection module is used for extracting time sequence characteristics in historical data accessed by each KV cache block in a historical time period in a large model long sequence reasoning process, the cache blocks are KV cache units in a reasoning framework of the large model in each time window, a data set of an LSTM model is constructed based on the time sequence characteristics, the label of the LSTM model is used for showing whether the cache blocks are accessed again in the next time window, the label is a classification value, a value of 1 is used for showing that the cache blocks are accessed in the next time window, a value of 0 is used for showing that the cache blocks are not accessed in the next time window, the data set is data with time sequence characteristics and comprises a training set and a verification set, and the time sequence characteristics comprise at least one of time acquired by the time sequence characteristics, access type, data quantity related to the current access, logic distance between a current page and a last access page, accumulated residence time of the current page in a display and a system real-time load index; the training module of the LSTM model is used for training the LSTM model by using the training set; The system comprises a verification set, a model evaluation module, a long-sequence reasoning cache decision-making module and a long-sequence reasoning cache decision-making module, wherein the verification set is used for verifying a trained LSTM model to obtain the trained LSTM model; the access data collection module is used for obtaining access data with time sequence characteristics of a user, wherein the access data comprises access information of KV data corresponding to at least one token; the buffer memory management module is used for inputting first KV data corresponding to part of the token into the trained LSTM model so that the LSTM model can determine whether the first KV data can be accessed in the next time period or not, storing the first KV data into a video memory if the first KV data can be accessed according to the tag, storing the first KV data into a memory if the first KV data cannot be accessed, storing the first KV data into the memory, and storing the first KV data into the video memory and the memory in priority, wherein the first KV data and the second KV data are used for KV buffer memory, and the first KV data which can be accessed in the next time period are stored in the video memory.
8. An electronic device comprising one or more processors configured to implement the hybrid KV cache management method of any of claims 1 to 6.
9. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the hybrid KV cache management method according to any one of claims 1 to 6.

Description

Hybrid KV cache management method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of storage management technologies, and in particular, to a method and apparatus for hybrid KV cache management, an electronic device, and a storage medium. Background With the widespread use of large models, the reasoning process often faces bottlenecks in computing resources and storage bandwidth due to the huge number of parameters and computing requirements. In the large model push of the related art, particularly for long sequence tasks, as the sequence length increases, the required key value pair buffering amount also increases dramatically. However, due to the limitation of the video memory capacity of the GPU (Graphics Processing Unit, graphics processor), long-sequence reasoning often cannot load all data at one time, resulting in insufficient video memory and affecting reasoning efficiency. Disclosure of Invention The application provides a hybrid KV cache management method, a hybrid KV cache management device, electronic equipment and a storage medium. The application provides a hybrid KV cache management method, which comprises the following steps: acquiring historical data accessed by each KV cache block in a historical time period in a large model long sequence reasoning process as a data set, wherein the data set is data with time sequence characteristics and comprises a training set and a verification set; Training an LSTM model by using the training set; Evaluating the trained LSTM model by using a verification set to obtain the trained LSTM model, wherein the trained LSTM model is deployed in an inference framework of a large model and is used for determining whether the next time period of KV data is accessed or not so as to guide a long-sequence inference cache decision; the method comprises the steps of obtaining access data with time sequence characteristics of a user, wherein the access data comprises access information of KV data corresponding to at least one token; The method comprises the steps of inputting first KV data corresponding to part of token into a trained LSTM model, enabling the LSTM model to determine whether the first KV data is accessed in the next time period, storing the first KV data into a video memory according to the label if the first KV data is accessed, storing the first KV data into a memory if the first KV data is not accessed, storing the first KV data into the memory and storing the first KV data into the memory, and storing the first KV data accessed in the next time period into the memory preferentially to improve the utilization rate of the video memory. Further, using the verification set, evaluating the trained LSTM model to obtain a trained LSTM model, including: using a verification set of the data set to evaluate whether the access prediction accuracy of the trained LSTM model to KV data reaches a target accuracy; if not, continuing to train the model until reaching the target accuracy, and obtaining a trained LSTM model. Further, the using the verification set to evaluate whether the access prediction accuracy of the trained LSTM model to KV data reaches the target accuracy includes: Predicting the accessed condition of the next time window of the cache by using the historical data of the first time period of the historical time period through the LSTM model; and comparing the predicted accessed condition with the actual accessed condition in the same time period of the verification set, and evaluating whether the access prediction accuracy of the trained LSTM model to KV data reaches the target accuracy. Further, the method further comprises the following steps of obtaining a data set: extracting time sequence characteristics in the history data accessed by each KV cache block in a history time period in the large model long sequence reasoning process, wherein a label of an LSTM model is used for indicating whether the cache block is accessed again in a next time window; based on the time series characteristics, a dataset of the LSTM model is constructed. Further, the time sequence features comprise at least one of time acquired by the time sequence features, access type, data volume related to current access, logic distance between a current page and a last access page, accumulated residence time of the current page in a video memory and a system real-time load index. Further, the access data comprises a long sequence formed by adding instructions to the document; and/or the number of the groups of groups, The access data contains dialog reasoning instructions. Further, before training the LSTM model using the training set, the method further includes: determining structural parameters of the LSTM model, wherein the structural parameters comprise hidden layer dimensions, stacking layers, adopting a bidirectional LSTM structure, an output layer and an activation function; The method comprises the step