EP-4738200-A1 - DYNAMIC QUANTIZATION AND MEMORY MANAGEMENT OF KEY-VALUE CACHE FOR SERVING LARGE LANGUAGE MODELS

EP4738200A1EP 4738200 A1EP4738200 A1EP 4738200A1EP-4738200-A1

Abstract

Key-value (KV) cache paging schemes can improve memory management for KV caches by storing a KV cache page having key tensors and value tensors for a fixed number of tokens in a fixed-sized block in the KV cache of a worker. To further improve memory management, the schemes can be modified to implement dynamic variable quantization. Quantization level of a KV cache page can be set based on a runtime importance score of the KV cache page. In addition, the quantization level of the KV cache page can be set based on the system load. The end result is a scheme that can achieve a high compression ratio of KV cache pages in the KV cache. Fitting more KV cache pages in the KV cache can lead to higher inference throughput, higher system-level user capacity, and higher end-to-end service availability.

Inventors

GOBRIEL, SAMEH
JAIN, NILESH
CHUA, VUI SENG
MUNOZ, JUAN PABLO
JHA, GOPI KRISHNA

Assignees

Intel Corporation

Dates

Publication Date: 20260506
Application Date: 20251001

Claims (15)

An apparatus, comprising: one or more processors to execute one or more instructions; and a memory to store data and the one or more instructions, wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to: determine an importance score of a key-value cache page based on one or more attention scores of one or more tokens, the key-value cache page having one or more key tensors and one or more value tensors calculated for one or more tokens by an attention head of a neural network; determine a quantization level based on the importance score; retrieve the key-value cache page at the determined quantization level from the memory; and store the key-value cache page quantized at the quantization level in a key-value cache block of a key-value cache.
The apparatus of claim 1, wherein the one or more instructions further cause the one or more processors to: updating a memory pointer to reference the key-value cache page quantized at the quantization level.
The apparatus of claim 1 or 2, wherein the one or more instructions further cause the one or more processors to: quantizing the key-value cache page at different quantization levels; and storing the key-value cache page quantized at the different quantization levels in the memory.
The apparatus of any one of claims 1-3, wherein determining the importance score is performed in response to a swap-out event of the key-value cache page.
The apparatus of any one of claims 1-4, wherein determining the importance score is performed in response to one or more new tokens being added to the key-value cache page.
The apparatus of any one of claims 1-5, wherein determining the quantization level is performed in response to a change in a number of outstanding requests to be executed by the neural network.
The apparatus of any one of claims 1-6, wherein determining the importance score comprises: determining whether at least one of the one or more attention scores is greater than a threshold.
The apparatus of any one of claims 1-7, wherein determining the importance score comprises: determining a ratio of a count of pivotal tokens whose attention scores is greater than a threshold and a sum of counts of pivotal tokens of the key-value cache page and one or more further key-value cache pages.
The apparatus of any one of claims 1-5, wherein determining the quantization level based on the importance score comprises: determining the quantization level according to a set of quantization levels, the set of quantization levels corresponding to different ranges of importance scores.
The apparatus of claim 9, wherein the set of quantization levels is determined based on one or more of: a maximum error tolerance, a total available memory of the key-value cache, a maximum concurrent load target, and a number of outstanding requests to be executed by the neural network.
A method for memory management, comprising: determining an importance score of a key-value cache page based on one or more attention scores of one or more tokens, the key-value cache page having one or more key tensors and one or more value tensors calculated for one or more tokens by an attention head of a neural network; determining a quantization level based on the importance score; and storing the key-value cache page quantized at the quantization level in a key-value cache block of a key-value cache.
The method of claim 11, wherein determining the importance score is performed in response to a swap-out event of the key-value cache page.
The method of claim 11 or 12, further comprising: quantizing the key-value cache page at different quantization levels; storing the key-value cache page quantized at the different quantization levels in a memory; and retrieving the key-value cache page quantized at the quantization level from the memory.
The method of any one of claims 11-13, wherein determining the quantization level based on the importance score comprises: determining the quantization level according to a set of quantization levels, the set of quantization levels corresponding to different ranges of importance scores, wherein the set of quantization levels is determined based on one or more of: a maximum error tolerance, a total available memory of the key-value cache, a maximum concurrent load target, and a number of outstanding requests to be executed by the neural network.
One or more non-transitory computer-readable media storing instructions executable by a computing processor to perform the method according to any one of claims 11-14.

Description

Background Deep neural networks (DNNs) are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as there can be a large number of operations as well as a large amount of data to read and write. Brief Description of the Drawings Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. FIG. 1 illustrates an exemplary large language model implemented as a transformer-based neural network, according to some embodiments of the disclosure.FIG. 2 illustrates a serial transformer block, according to some embodiments of the disclosure.FIG. 3 illustrates a parallel transformer block, according to some embodiments of the disclosure.FIG. 4 illustrates an attention layer of a transformer block, according to some embodiments of the disclosure.FIG. 5 illustrates computations in a self-attention layer without key-value (KV) caching, according to some embodiments of the disclosure.FIG. 6 illustrates computations in a self-attention layer with KV caching, according to some embodiments of the disclosure.FIG. 7 illustrates a system having distributed workers to execute requests of a transformer-based neural network, according to some embodiments of the disclosure.FIG. 8 illustrates a KV cache paging scheme, according to some embodiments of the disclosure.FIG. 9 illustrates a non-critical path in the case of a KV cache swap-out event, according to some embodiments of the disclosure.FIG. 10 illustrates a KV cache paging scheme with dynamic variable quantization, according to some embodiments of the disclosure.FIG. 11 illustrates a KV cache manager to implement dynamic variable quantization, according to some embodiments of the disclosure.FIG. 12 is a flowchart illustrating a method for KV caching with dynamic variable quantization, according to some embodiments of the disclosure.FIG. 13 is a flowchart illustrating another method for KV caching with dynamic variable quantization, according to some embodiments of the disclosure.FIG. 14 is a block diagram of an exemplary computing device, according to some embodiments of the disclosure. Detailed Description Overview The last decade has witnessed a rapid rise in artificial intelligence (AI) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. A DNN typically includes a sequence of layers. A DNN layer may include one or more deep learning operations (also referred to as "neural network operations"), such as convolution operation, matrix multiplication operation, layer normalization operation, batch normalization operation, SoftMax operation, pooling operation, element-wise operation, linear operation, non-linear operation, and so on. While DNNs are effective at analyzing and predicting, they come at a cost of immense computational power. DNNs can consume significant power and runtime during training and during inference. Transformer-based neural networks or transformer-based models are a type of DNN that can be used to power large language models (LLMs) and computer vision models (referred to in literature as ViTs). Transformer-based neural networks are used in services and applications such as natural language processing, speech processing, conversational AI assistants, image captioning, object detection, video understanding, recommendation systems, bioinformatics, time-series forecasting, reinforcement learning, and generative models to produce text, image, or music. Cloud companies can offer a transformer-based neural network as a hosted service, where the transformer-based neural network can be served by many distributed graphical processing units (GPU) workers, and the hosted service can service many requests for many users. For some LLMs or other machine learning models, an autoregressive transformer-based neural network is used. The transformer-based neural network can generate one token at a time (e.g., one word at a time) based on an input prompt and the previous sequence of the output's tokens that the transformer-based neural network has generated so far. The process involving performing all the operations in the transformer-based neural network is repeated, token by token, until the transformer-based neural network outputs a termination token. A key-value (KV) cache is introduced to avoid redundant computations when generating tokens one at a time. Specifically,