CN-121996577-A - Dynamic quantization and memory management for key value caches for servicing large language models

CN121996577ACN 121996577 ACN121996577 ACN 121996577ACN-121996577-A

Abstract

A key-value (KV) cache paging scheme may improve memory management for KV caches by storing KV cache pages with key tensors and value tensors for a fixed number of tags in fixed-size blocks in the KV cache of the worker. To further improve memory management, these schemes may be modified to achieve dynamic variable quantization. The quantization level of the KV cache page may be set based on the runtime importance score of the KV cache page. In addition, the quantization level of KV cache pages may be set based on system load. The end result is a scheme that can achieve a high compression ratio for KV cache pages in KV caches. Loading more KV cache pages in KV cache can lead to higher reasoning throughput, higher system level user capacity, and higher end-to-end service availability.

Inventors

S. Goblil
N. Jia Yin
V.S.Cai
J. P. Munoz
G. K. Jeha

Assignees

英特尔公司

Dates

Publication Date: 20260508
Application Date: 20250930
Priority Date: 20241101

Claims (12)

1. A method for memory management, comprising: Determining an importance score for a key-value cache page based on one or more attention scores of one or more markers, the key-value cache page having one or more key tensors and one or more value tensors calculated for the one or more markers, and the one or more key tensors and the one or more value tensors calculated by an attention header of a neural network; Determining a quantization level based on the importance score, and And storing the key value buffer pages quantized by the quantization levels in a key value buffer block of a key value buffer.
2. The method of claim 1, further comprising: a memory pointer is updated to reference the key cache page quantized at the quantization level.
3. The method of claim 1 or 2, further comprising: quantizing the key value cache page at different quantization levels; storing said key-value cache pages quantized at said different quantization levels in a memory, and Retrieving from the memory the key cache page quantized at the quantization level.
4. The method of claim 1 or 2, wherein determining the importance score is performed in response to a swap-out event of the key-value cache page.
5. The method of claim 1 or 2, wherein determining the importance score is performed in response to one or more new tags being added to the key-value cache page.
6. The method of claim 1 or 2, wherein determining the quantization level is performed in response to a change in a number of outstanding requests to be performed by the neural network.
7. The method of claim 1 or 2, wherein determining the importance score comprises: a determination is made as to whether at least one of the one or more attention scores is greater than a threshold.
8. The method of claim 1 or 2, wherein determining the importance score comprises: a ratio of a count of key marks having an attention score greater than a threshold to a sum of the counts of key marks of the key-value cache page and one or more other key-value cache pages is determined.
9. The method of claim 1 or 2, wherein determining the quantization level based on the importance score comprises: The quantization levels are determined from a set of quantization levels, which correspond to different ranges of importance scores.
10. The method of claim 9, wherein the set of quantization levels is determined based on one or more of a maximum error margin, a total available memory of the key value cache, a maximum concurrent load target, and a number of outstanding requests to be performed by the neural network.
11. One or more non-transitory computer-readable media storing instructions executable by a computing processor to perform operations for memory management, the operations comprising any of the methods of claims 1-10.
12. An apparatus, comprising: one or more of the processors of the present invention, the one or more processors are configured to execute one or more instructions; and A memory for storing data and the one or more instructions, wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement any of the methods of claims 1-10.

Description

Dynamic quantization and memory management for key value caches for servicing large language models Background Deep Neural Networks (DNNs) are widely used in a variety of artificial intelligence applications, ranging from computer vision to speech recognition and natural language processing, due to their ability to achieve high accuracy. However, high accuracy comes at significant computational cost. DNNs have extremely high computational demands because there may be a large number of operations and a large amount of data to read and write. Drawings Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. For convenience of description, like reference numerals denote like structural elements. In the figures of the accompanying drawings, embodiments are shown by way of example and not by way of limitation. Fig. 1 illustrates an exemplary large language model implemented as a transducer-based neural network, according to some embodiments of the present disclosure. Fig. 2 illustrates a serial transducer block according to some embodiments of the present disclosure. Fig. 3 illustrates a parallel transducer block according to some embodiments of the present disclosure. Fig. 4 illustrates an attention layer (attention layer) of a transducer block according to some embodiments of the present disclosure. Fig. 5 illustrates a calculation in a self-attention layer (self-attention layer) without a key-value (KV) caching operation, according to some embodiments of the present disclosure. Fig. 6 illustrates calculations in the self-attention layer in the case of KV caching operations, according to some embodiments of the present disclosure. Fig. 7 illustrates a system with a distributed worker (worker) to perform a request for a transducer-based neural network, according to some embodiments of the present disclosure. Fig. 8 illustrates a KV cache paging scheme in accordance with some embodiments of the present disclosure. Fig. 9 illustrates non-critical paths in the case of a KV cache swap out (swap-out) event according to some embodiments of the present disclosure. Fig. 10 illustrates a KV cache paging scheme with dynamically variable quantization, according to some embodiments of the present disclosure. FIG. 11 illustrates a KV cache manager for implementing dynamic variable quantization in accordance with some embodiments of the present disclosure. FIG. 12 is a flow chart illustrating a method for KV cache operation with dynamic variable quantization in accordance with some embodiments of the present disclosure. FIG. 13 is a flow chart illustrating another method for KV cache operation with dynamic variable quantization in accordance with some embodiments of the present disclosure. Fig. 14 is a block diagram of an exemplary computing device according to some embodiments of the present disclosure. Detailed Description SUMMARY The last decade has witnessed a rapid increase in Artificial Intelligence (AI) -based (in particular DNN-based) data processing. DNNs are widely used in the fields of computer vision, speech recognition, image and video processing, mainly because DNNs enable accuracy beyond the human level. DNNs typically comprise a sequence of layers. The DNN layer may include one or more deep learning operations (also referred to as "neural network operations"), such as convolution operations, matrix multiplication operations, layer normalization operations, batch (batch) normalization operations, softMax operations, pooling operations, element-by-element operations, linear operations, nonlinear operations, and the like. While DNN is efficient in analysis and prediction, DNN comes at the cost of significant computational power. DNN may consume significant power and run time during training and reasoning. A transformer-based neural network or a transformer-based model is a DNN that can be used to drive a Large Language Model (LLM) and a computer vision model (referred to in the literature as ViT). Transformer-based neural networks are used in services and applications such as natural language processing, speech processing, conversational AI assistants, image subtitle generation, object detection, video understanding, recommendation systems, bioinformatics, time series prediction, reinforcement learning, and generative models to produce text, images, or music. The cloud company may provide a transformer-based neural network as a hosted service, where the transformer-based neural network may be serviced by a number of distributed Graphics Processing Unit (GPU) workers, and the hosted service may service a number of requests for a number of users. For some LLM or other machine learning models, an autoregressive, transformer-based neural network is used. The transducer-based neural network may generate one token at a time (e.g., generate one word at a time) based on the input cues and the previous sequence of tokens (token) of the output that has bee