CN-121979810-A - Key value cache compression method and electronic equipment
Abstract
The application is suitable for the technical field of large language models, and provides a key value cache compression method and electronic equipment. The key value cache compression method comprises the steps of determining a behavior mode of each attention head in a trained multi-mode large language model in a decoding stage, determining a target cache budget of each attention head according to the behavior mode of each attention head, and performing key value cache compression processing on each attention head according to the target cache budget and the behavior mode of each attention head. By adopting the key value cache compression provided by the application, different key value cache budget determination methods can be adopted for the attention heads of different behavior modes, and different key value cache compression processing methods can also be adopted for the attention heads of different behavior modes, so that the effect of carrying out key value cache compression on the multi-mode large language model is improved, and the performance of the multi-mode large language model is further improved.
Inventors
- LI HUAN
- ZENG BOWEN
- REN FEIYANG
- ZHANG JUN
- LUO XINYUAN
- CHEN GANG
- CHEN KE
- SHOU LIDAN
Assignees
- 杭州高新区(滨江)区块链与数据安全研究院
- 浙江大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251225
Claims (10)
- 1. The key value cache compression method is characterized by comprising the following steps of: determining a behavior mode of each attention head in the trained multi-mode large language model in a decoding stage; According to the behavior mode corresponding to each attention head, respectively determining the target cache budget corresponding to each attention head; And aiming at each attention head, carrying out key value cache compression processing on the attention head according to the target cache budget and the behavior mode corresponding to the attention head.
- 2. The method of claim 1, wherein the behavior patterns include a static pattern and a dynamic pattern, wherein for each of the attention heads, performing a key cache compression process on the attention head according to the target cache budget and the behavior pattern corresponding to the attention head, comprises: And aiming at each attention head, if the behavior mode corresponding to the attention head is a static mode, performing key value cache compression processing on the attention head according to the target cache budget corresponding to the attention head and a first key value cache compression strategy corresponding to the static mode, and if the behavior mode corresponding to the attention head is a dynamic mode, performing key value cache compression processing on the attention head according to the target cache budget corresponding to the attention head and a second key value cache compression strategy corresponding to the dynamic mode.
- 3. The method of claim 2, wherein determining a respective behavior pattern at the decoding stage for each attention header in the trained multimodal large language model comprises: For each attention head, determining the attention sparsity of the attention head, which is centered on the text in the pre-filling stage, and determining the behavior mode corresponding to the attention head according to the attention sparsity corresponding to the attention head.
- 4. A method according to claim 3, wherein said determining the text-centric distraction of the attention head in the pre-fill stage comprises: And in the pre-filling stage, inputting a preset text into the attention head, taking the preset text as a query vector, calculating the attention distribution of the attention head to each key vector in the preset text, and determining the attention sparsity of the attention head according to the attention distribution of the attention head to each key vector in the preset text.
- 5. The method of claim 4, wherein said determining said attention sparsity of the attention header based on said attention profile of the attention header to each of said key vectors in said preset text comprises: For each key vector, N key vectors corresponding to the key vector are screened out according to the attention distribution corresponding to the key vector, and a second attention distribution score corresponding to the key vector is determined according to a first attention distribution score of each key vector corresponding to the key vector, wherein the key vector is a key vector with the attention distribution score of the first N in all key vectors except the key vector, and N >0; And determining the attention sparsity of the attention head according to the second attention distribution scores corresponding to the key vectors.
- 6. A method according to claim 3, wherein said determining a respective target cache budget for each of said attention heads based on said respective behavior patterns for each of said attention heads comprises: determining average cache budget of each attention head according to preset total cache budget and the number of all the attention heads in the multi-modal large language model; determining a first cache budget sum of all first attention heads and a second cache budget sum of all second attention heads according to the total cache budget, the average cache budget, the sharing coefficient and the number of the first attention heads, wherein the behavior mode corresponding to the first attention heads is a static mode, and the behavior mode corresponding to the second attention heads is a dynamic mode; for each first attention head, determining the target cache budget corresponding to the first attention head according to the first cache budget sum, the number of the first attention heads and the attention sparsity corresponding to the first attention head; and for each second attention head, determining the target cache budget corresponding to the second attention head according to the sum of the second cache budgets and the number of the second attention heads.
- 7. The method according to any one of claims 2 to 6, wherein the performing, according to the target cache budget corresponding to the attention header and the first key value cache compression policy corresponding to the static mode, key value cache compression processing on the attention header includes: Determining whether the key vector is a target key vector according to each key vector corresponding to the attention head, wherein the target key vector comprises any one or more of all the words in a preset observation window, all text words in a history context and vision words with the correlation ranking M in the history context, and M >0; And carrying out key value cache compression processing on the attention head according to the target cache budget corresponding to the attention head and each target key vector.
- 8. The method according to claim 7, wherein the performing key value cache compression processing on the attention header according to the target cache budget corresponding to the attention header and a second key value cache compression policy corresponding to a dynamic mode includes: Dividing the key value cache sequence of the attention header into a plurality of subsequences, determining the average value of all key vectors of each subsequence, and generating metadata corresponding to each subsequence according to the average value of all key vectors of each subsequence; Migrating the key value cache sequence of the attention head from a current processor to a preset target processor, and storing indexes corresponding to each subsequence to the current processor, wherein the current processor is a processor for running the multi-mode large language model, the target processor is a processor for storing the key value cache sequence, and the indexes corresponding to the subsequences are formed by the metadata corresponding to the subsequences.
- 9. The method of claim 8, wherein after performing a key cache compression process on each of the attention headers according to the target cache budget and the behavior pattern corresponding to the attention header, the method further comprises: Determining a first target subsequence with highest correlation with a current query vector when decoding through the multi-modal large language model each time, and loading a plurality of second target subsequences including the first target subsequence into the current processor from the target processor according to the target cache budget corresponding to each attention head; And performing attention calculation according to each target key vector and each second target subsequence through the multi-modal large language model to generate a target vector.
- 10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the key value cache compression method of any one of claims 1 to 9.
Description
Key value cache compression method and electronic equipment Technical Field The application belongs to the technical field of large language models, and particularly relates to a key value cache compression method and electronic equipment. Background With the rapid development of multi-modal large language models, the multi-modal large language models have made significant progress in the performance of processing inference tasks including text, images, and video. However, in doing the reasoning task, the multimodal large language model faces challenges due to the rapid growth of Key Value (KV) caches. Specifically, during the reasoning task, each visual input (e.g., image or video) of the multimodal large language model is expanded into thousands of tokens (token), resulting in an increase in the size of the key value cache with the context and a continuous occupancy of processor (e.g., graphics processor) memory during decoding. At present, in order to solve the above problems, some key value cache compression methods are proposed in the prior art, however, these methods only pay attention to the cache budget allocation of the key value cache on different granularities, and do not consider that different key value cache budget determination methods are adopted for attention heads of different behavior modes, and also do not adopt different key value cache compression processing methods for attention heads of different behavior modes, so that the compression effect of the existing key value cache compression methods is poor, and the performance of the multi-modal large language model is reduced. Disclosure of Invention In view of the above, the embodiment of the application provides a key value cache compression method and electronic equipment, so as to solve the technical problem of lower performance of a multi-mode large language model in the prior art. In a first aspect, an embodiment of the present application provides a key value cache compression method, including: determining a behavior mode of each attention head in the trained multi-mode large language model in a decoding stage; According to the behavior mode corresponding to each attention head, respectively determining the target cache budget corresponding to each attention head; And aiming at each attention head, carrying out key value cache compression processing on the attention head according to the target cache budget and the behavior mode corresponding to the attention head. Optionally, the behavior mode includes a static mode and a dynamic mode, and the performing, for each attention header, key value cache compression processing on the attention header according to the target cache budget and the behavior mode corresponding to the attention header includes: And aiming at each attention head, if the behavior mode corresponding to the attention head is a static mode, performing key value cache compression processing on the attention head according to the target cache budget corresponding to the attention head and a first key value cache compression strategy corresponding to the static mode, and if the behavior mode corresponding to the attention head is a dynamic mode, performing key value cache compression processing on the attention head according to the target cache budget corresponding to the attention head and a second key value cache compression strategy corresponding to the dynamic mode. Optionally, the determining the behavior mode of each attention head in the trained multi-mode large language model in the decoding stage includes: For each attention head, determining the attention sparsity of the attention head, which is centered on the text in the pre-filling stage, and determining the behavior mode corresponding to the attention head according to the attention sparsity corresponding to the attention head. Optionally, the determining the attention sparsity of the attention head in the pre-filling stage, where the attention sparsity is centered on the text includes: And in the pre-filling stage, inputting a preset text into the attention head, taking the preset text as a query vector, calculating the attention distribution of the attention head to each key vector in the preset text, and determining the attention sparsity of the attention head according to the attention distribution of the attention head to each key vector in the preset text. Optionally, the determining the attention sparsity of the attention head according to the attention distribution of the attention head to each key vector in the preset text includes: For each key vector, N key vectors corresponding to the key vector are screened out according to the attention distribution corresponding to the key vector, and a second attention distribution score corresponding to the key vector is determined according to a first attention distribution score of each key vector corresponding to the key vector, wherein the key vector is a key vector with the attention distribution