CN-121998014-A - Efficient token pruning in a converter-based neural network

CN121998014ACN 121998014 ACN121998014 ACN 121998014ACN-121998014-A

Abstract

The application relates to efficient token pruning in a converter-based neural network. Key-value (KV) caching accelerates reasoning in Large Language Models (LLMs) by allowing attention operations to expand linearly (rather than secondarily) with the total sequence length. Due to the large context length in modern LLMs, KV cache sizes may exceed model sizes, which may negatively impact throughput. To address this problem, KVCrush is implemented, which represents a key-value cache size reduction using head behavior similarity. KVCrush relates to representing tokens using binary vectors, where the vectors indicate which attention headers focus on the token and which attention headers ignore the token. Binary vectors are used in a hardware efficient low-overhead process to produce a representation of unimportant tokens to be pruned without implementing k-means clustering techniques.

Inventors

Gopi Krishna Jia
Samah Gobriel
Nalesh Jane

Assignees

英特尔公司

Dates

Publication Date: 20260508
Application Date: 20250930
Priority Date: 20241226

Claims (17)

1. A method for managing key-value caches, comprising: calculating one or more binary vectors representing one or more tokens of the request to the neural network; Assigning one or more selected ones of the one or more tokens to a bucket based on the one or more binary vectors; selecting representative tokens to represent the one or more selected tokens assigned to the bucket; storing in the key-value cache a representative key tensor and a representative value tensor computed for the representative token, and The representative key tensor and the representative value tensor are provided from the key-value cache to an operation of the neural network operating on the one or more selected tokens assigned to the bucket.
2. The method of claim 1, wherein computing the one or more binary vectors comprises computing the one or more binary vectors based on one or more attention weight matrices computed by one or more attention heads of the neural network.
3. The method of claim 1 or 2, wherein computing the one or more binary vectors comprises: Calculating normalized row sum values of the attention weight matrix calculated by the neural network, and The normalized row sum value is binarized using a threshold value.
4. The method of claim 1 or 2, wherein a binary vector of the one or more binary vectors comprises a binary value for each of one or more attention heads of the neural network.
5. The method of claim 4, wherein the binary value is a select bit indicating whether a given one of the neural network's attention headers focuses on a particular token or ignores the particular token.
6. The method of claim 1 or 2, wherein assigning the one or more selected tokens to the bucket comprises: Calculating one or more distances of the one or more binary vectors representing the one or more tokens from an anchor vector; Determining a distance range corresponding to the bucket, and The one or more selected tokens are assigned to the bucket based on the distance range and the one or more distances.
7. The method of claim 1 or 2, wherein assigning the one or more selected tokens to the bucket comprises: Randomly determining one or more values of the anchor vector, and The one or more selected tokens are assigned to the bucket based on one or more distances of the one or more binary vectors from the anchor vector.
8. The method of claim 1 or 2, wherein assigning the one or more selected tokens to the bucket comprises: determining one or more values of the anchor vector based on the mean of the one or more binary vectors, and The one or more selected tokens are assigned to the bucket based on one or more distances of the one or more binary vectors from the anchor vector.
9. The method of any of claims 1-6, wherein assigning the one or more selected tokens to the bucket comprises: The one or more selected tokens are assigned to the bucket based on one or more distances of the one or more binary vectors from an anchor vector having alternating zeros and ones.
10. The method of claim 6, wherein the one or more distances comprise one or more hamming distances.
11. The method of claim 1 or 2, wherein selecting the representative token from the one or more selected tokens assigned to the bucket comprises: The representative token is randomly selected from the one or more selected tokens.
12. The method according to claim 1 or 2, wherein: The representative key tensor and the representative value tensor are calculated for the representative token by an attention header of the neural network, and The representative key tensor and the representative value tensor represent one or more key tensors and one or more value tensors calculated by an attention header of the neural network for the one or more selected tokens.
13. The method according to claim 1 or 2, further comprising: assigning one or more further selected ones of the one or more tokens to another bucket based on the one or more binary vectors; selecting another representative token to represent one or more further selected tokens assigned to the other bucket; Storing in the key-value cache a further representative key tensor and a further representative value tensor calculated for the further representative token; Determining that the other token of the request is one of the one or more further selected tokens assigned to the other bucket, and The further representative key tensor and the further representative value tensor are provided from the key-value cache for use in the operation of the neural network operating on the further token.
14. The method according to claim 1 or 2, further comprising: Determining that the one or more selected tokens are to be pruned based on one or more importance values of the one or more selected tokens.
15. The method according to claim 1 or 2, further comprising: One or more key tensors and one or more value tensors computed for one or more remaining tokens assigned to the bucket that are not selected as the representative tokens are evicted from the key-value cache.
16. One or more non-transitory computer-readable media storing instructions executable by a processor to perform the method of any one of claims 1-15.
17. An apparatus, comprising: machine-readable instructions; One or more memories storing key-value caches, and At least one computer processor that, when executing the machine-readable instructions, performs the method of any of claims 1-15.

Description

Efficient token pruning in a converter-based neural network Priority The present application is a non-provisional application claiming priority and/or benefit from U.S. provisional application No. 63/716,292, entitled "EFFICIENT TOKEN PRUNING IN TRANSFORMER-BASED NEURAL NETWORKS" (efficient token pruning in a transducer-based neural network) filed on month 11 of 2024. The U.S. provisional application is incorporated herein by reference in its entirety. Technical Field The present disclosure relates to efficient token pruning in a converter-based neural network. Background Deep Neural Networks (DNNs) are widely used in a variety of artificial intelligence fields from computer vision to speech recognition and natural language processing because of their ability to achieve high precision. However, such high precision comes at the cost of enormous computational cost. DNNs have extremely high computational demands because there may be a large number of operations and a large amount of data to read and write. Disclosure of Invention According to an aspect of the present application there is provided a method for managing a key-value cache comprising computing one or more binary vectors representing one or more tokens of a request to a neural network, assigning one or more selected tokens of the one or more tokens to a bucket based on the one or more binary vectors, selecting representative tokens to represent the one or more selected tokens assigned to the bucket, storing in the key-value cache representative key tensors and representative value tensors computed for the representative tokens, and providing the representative key tensors and the representative value tensors from the key-value cache to operation of the neural network operating on the one or more selected tokens assigned to the bucket. According to another aspect of the application there is provided an apparatus comprising machine readable instructions, one or more memories storing a key-value cache, and at least one computer processor which when executing the machine readable instructions performs the above method. Drawings The embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Fig. 1 illustrates an exemplary large language model implemented as a transducer-based neural network in accordance with some embodiments of the present disclosure. Fig. 2 illustrates a serial converter block according to some embodiments of the present disclosure. Fig. 3 illustrates a parallel converter block according to some embodiments of the present disclosure. Fig. 4 illustrates an attention layer of a converter block according to some embodiments of the disclosure. Fig. 5 illustrates computations in a self-attention layer of a Keyvalue (KV) cache, according to some embodiments of the present disclosure. Fig. 6 illustrates calculations in a self-attention layer with KV caching, according to some embodiments of the present disclosure. Fig. 7 illustrates a system with distributed workers (workers) to perform requests for a transducer-based neural network, according to some embodiments of the present disclosure. Fig. 8 illustrates a KV cache paging scheme according to some embodiments of the present disclosure. Fig. 9 illustrates a KV cache controller according to some embodiments of the present disclosure. Fig. 10 illustrates a solution flow for efficient KV cache management according to some embodiments of the present disclosure. Fig. 11 illustrates calculating attention weights according to some embodiments of the present disclosure. FIG. 12 illustrates a computed binary vector representation according to some embodiments of the disclosure. FIG. 13 illustrates a selection representative token (token) according to some embodiments of the present disclosure. Fig. 14 is a flowchart illustrating a method for KV cache management according to some embodiments of the present disclosure. Fig. 15 is a flowchart illustrating a method for KV cache management according to some embodiments of the present disclosure. Fig. 16 is a block diagram of an exemplary computing device according to some embodiments of the present disclosure. Detailed Description Overview The last decade has witnessed a rapid development of Artificial Intelligence (AI) based data processing, in particular DNN based data processing. DNN is widely used in the fields of computer vision, speech recognition, image and video processing, mainly because it enables accuracy beyond the human level. DNNs typically comprise a series of layers. The DNN layer may include one or more deep learning operations (also referred to as "neural network operations"), such as convolution operations, matrix multiplication operations, layer normalization