Search

CN-121980031-A - Sentence semantic cluster compression method based on heterogeneous KV cache, electronic equipment and program product

CN121980031ACN 121980031 ACN121980031 ACN 121980031ACN-121980031-A

Abstract

The invention discloses a sentence semantic cluster compression method, electronic equipment and a program product based on heterogeneous KV cache. The method comprises the steps of keeping the first T token of a given query on a GPU, segmenting the rest token to obtain S sentences, regarding the average value of Key vectors corresponding to the token as the center of each sentence, calculating the similarity between the Key vector corresponding to each token and the center of the sentence, calculating the GSA weight of each token according to the similarity, calculating semantic representation of each token according to the GSA weight, and clustering the semantic representation of the S sentences to obtain C cluster representations. The invention improves the accuracy and efficiency of long context reasoning at the same time under the limited KV budget, and reduces the long sequence reasoning time delay and the video memory pressure.

Inventors

  • XU HUI
  • PAN JUNBIN
  • ZHANG TAO
  • TAN HAO
  • WANG ANXIN
  • GAO YIXUAN
  • ZHU LUYUE
  • SHAO JIE

Assignees

  • 电子科技大学(深圳)高等研究院

Dates

Publication Date
20260505
Application Date
20260409

Claims (10)

  1. 1. A sentence semantic cluster compression method based on heterogeneous KV cache is characterized by comprising the following steps: for a given query, keeping the first T token on the GPU, and segmenting the rest token to obtain S sentences; For each sentence, taking the average value of the Key vectors corresponding to the token as the sentence center, calculating the similarity between the Key vector corresponding to each token and the sentence center, calculating the GSA weight of each token according to the similarity, and calculating the semantic representation according to the GSA weight; And clustering the semantic representations of the S sentences to obtain C cluster representations.
  2. 2. The sentence semantic cluster compression method based on heterogeneous KV cache according to claim 1, wherein when the rest of the token is segmented, the starting boundary index and the sequence ID of each segmented sentence are recorded.
  3. 3. The sentence semantic cluster compression method based on heterogeneous KV cache according to claim 1, wherein clustering the semantic representations of S sentences comprises: Randomly sampling an initialization cluster center from semantic representations of the S sentences; Calculating the distance between each semantic representation and the initialization center; Assigning each of said semantic representations to a center nearest to each of said semantic representations according to said distance; Updating the initialization cluster center to be the mean value of the semantic representations of the S sentences, and returning to the step corresponding to the step of randomly sampling the initialization cluster center from the semantic representations of the S sentences to iterate until the whole process converges.
  4. 4. The sentence semantic cluster compression method based on heterogeneous KV cache according to claim 1, wherein after clustering is finished, cluster meta information is kept on the GPU, and except for the previous T token, KV of the rest token is unloaded into a CPU memory for storage.
  5. 5. The method for compressing semantic clusters of sentences based on heterogeneous KV cache according to claim 1, further comprising maintaining a sentence query buffer for accumulating token queries generated in the current generation process until the end of sentence symbol is generated.
  6. 6. The sentence semantic cluster compression method based on heterogeneous KV cache of claim 5, wherein queries in the query buffer are averaged to obtain sentence-level search vectors.
  7. 7. The method of semantic cluster compression of sentences based on heterogeneous KV cache according to claim 6, wherein attention weights between the sentence-level search vector and each of the cluster representations are calculated, each cluster representation is traversed according to the attention weights, sentences are selected starting from the cluster representation with the highest attention weight, and recalls are not more than KV corresponding to each token.
  8. 8. The method for compressing semantic clusters of sentences based on heterogeneous KV cache according to claim 7, wherein if budget is exhausted in the last selected sentence, only the sentence is truncated to obtain attention context, wherein the attention context comprises a token always residing in a GPU, a retrieved token, and a token still remained on the GPU since last clustering.
  9. 9. An electronic device, comprising: At least one processor; At least one memory for storing processor-executable instructions; Wherein the at least one processor is configured to implement a sentence semantic cluster compression method based on heterogeneous KV cache as claimed in any of claims 1 to 8 by executing the executable instructions.
  10. 10. A computer program product comprising a computer program or instructions which when executed by a processor implements a heterogeneous KV cache based sentence semantic cluster compression method according to any of claims 1 to 8.

Description

Sentence semantic cluster compression method based on heterogeneous KV cache, electronic equipment and program product Technical Field The invention relates to the technical field of natural language processing, in particular to a sentence semantic cluster compression method, electronic equipment and a program product based on heterogeneous KV cache. Background Large language models (Large Language Models, LLMs) are being widely used in complex real tasks such as code generation, question-and-answer systems, and long text generation. As application scenarios continue to complicate, such tasks often require models to handle longer inputs and utilize more rich context information, such as multiple rounds of conversations, very long documents, or large-scale code warehouses, etc. Therefore, having efficient long context modeling and reasoning capabilities has become one of the key requirements of LLMs in real deployments. Meanwhile, in the application scene of high specialization, strong real-time and coexistence of multi-source heterogeneous information such as satellite communication, the model not only needs to understand multi-dimensional information such as communication protocol, link state, network topology and task instruction, but also needs to comprehensively analyze and infer by combining with service requirements in complex environments. Therefore, the method has high-efficiency long-context modeling and reasoning capacity, becomes one of key requirements of LLMs in real deployment, and lays a foundation for application in the directions of space-earth integrated network, satellite link management, intelligent operation and maintenance, task decision support and the like. In a long context reasoning scene, the problem is that Key (Key) Value versus (KV) cache increases linearly along with the length of a sequence, so that the occupation of a video memory is too high, and the reasoning delay is obviously increased due to frequent reading of history KV. A new solution is needed to solve the above problems. Disclosure of Invention The invention aims to provide a sentence semantic cluster compression method, electronic equipment and a program product based on heterogeneous KV cache, so as to solve the technical problems in the prior art. The preferred technical solutions of the technical solutions provided by the present invention can produce a plurality of technical effects described below. In order to achieve the above purpose, the present invention provides the following technical solutions: The invention provides a sentence semantic clustering compression method based on heterogeneous KV cache, which comprises the following steps: for a given query, keeping the first T token on the GPU, and segmenting the rest token to obtain S sentences; For each sentence, taking the average value of the Key vectors corresponding to the token as the sentence center, calculating the similarity between the Key vector corresponding to each token and the sentence center, calculating the GSA weight of each token according to the similarity, and calculating the semantic representation according to the GSA weight; And clustering the semantic representations of the S sentences to obtain C cluster representations. In one or more embodiments, when the remaining token is cut, the starting boundary index and the sequence ID of each cut sentence are recorded. In one or more embodiments, clustering the semantic representations of S sentences includes: and randomly sampling the center of the initialization cluster from the semantic representations of the S sentences, calculating the distance between each semantic representation and the initialization center, distributing each semantic representation to the center closest to each semantic representation according to the distance, updating the initialization cluster center to be the mean value of the semantic representations of the S sentences, and returning to the step corresponding to randomly sampling the initialization cluster center from the semantic representations of the S sentences to iterate until the whole process converges. In one or more embodiments, after the clustering is finished, the clustering meta-information is kept on the GPU, and except for the first T token, the KV of the rest of the token are offloaded to the CPU memory for storage. In one or more embodiments, maintaining a sentence query buffer for accumulating token queries generated during the current generation process until an end of period symbol is generated is also included. In one or more embodiments, queries of the query buffer are averaged to obtain a sentence-level search vector. In one or more embodiments, an attention weight between the sentence-level search vector and each of the cluster representations is calculated, each cluster representation is traversed according to the attention weight, sentences are selected starting from the cluster representation with the highest attention weight, and