US-12626695-B1 - Cache techniques for large language model processing
Abstract
Techniques for cache management for reducing latency in LLM inferencing are described. In some embodiments, a system caches encoded data of portions of a prompt so that the encoded data is available for use by the LLM across dialog turns of a dialog session. Within a dialog session, a portion of the LLM prompt may be the same across dialog turns, and instead of recomputing the attention/encodings for such portions, the cached encodings can be used by the LLM during processing. In some embodiments, user inputs for the dialog session may be routed to the same LLM container and encoded data for the dialog session may be stored at the same cache associated with the LLM. In some embodiments, the system enables asynchronous prompt encoding while performing ASR processing.
Inventors
- Kartik Balasubramaniam
- Venkata Siva Sai Krishna Balakavi
- Austin Doolittle
Assignees
- AMAZON TECHNOLOGIES, INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20230919
Claims (20)
- 1 . A computer-implemented method comprising: receiving first audio data representing a first user input of a dialog session associated with a session identifier; determining, using the first audio data, a first transcript representing the first user input, the first transcript associated with the session identifier; determining first context data corresponding to the first user input; generating a first prompt including at least the first transcript and the first context data, the first prompt being a first input for a large language model (LLM) to determine a response to the first user input; selecting, from a group of LLMs, a first LLM to process the first prompt, the first LLM associated with a container identifier for a container that includes the first LLM and corresponding components to enable processing by the first LLM; determining, using the first LLM, first encoded representations corresponding to the first prompt; storing, at a cache associated with the first LLM, the first encoded representations; based on storing the first encoded representations at the cache, storing first data representing an association between the session identifier and the container identifier; determining, using the first LLM and the first encoded representations, a first response to the first user input; causing presentation of the first response; receiving second audio data representing a second user input of the dialog session, the second audio data associated with the session identifier; determining, using the second audio data, a second transcript representing the second user input, the second transcript associated with the session identifier; generating a second prompt including at least the first context data, the first transcript, the first response and the second transcript, the second prompt being a second input for a LLM to determine a response to the second user input; based on the first data and based on the second user input being associated with the session identifier, selecting, from the group of LLMs, the first LLM to process the second prompt; determining, from the cache associated with the first LLM, the first encoded representations corresponding to a first portion of the second prompt; determining, using the first LLM, second encoded representations corresponding to a second portion of the second prompt; storing, at the cache, the second encoded representations; determining, using the first LLM, the first encoded representations and the second encoded representations, a second response to the second user input; and causing presentation of the second response.
- 2 . The computer-implemented method of claim 1 , wherein determining the second encoded representations corresponding to the second prompt further comprises: determining, using the stored first encoded representations, that a first portion of the second prompt corresponds to a second portion of the first encoded representations, the first portion of the second prompt including the first context data and the first transcript; determining a third portion of the second prompt including the first response and the second transcript; determining, using the first LLM, third encoded representations corresponding to the third portion of the second prompt; and determining the second encoded representations to include the second portion of the first encoded representations and the third encoded representations.
- 3 . The computer-implemented method of claim 1 , further comprising: performing automatic speech recognition (ASR) processing using the first audio data to determine the first transcript; generating a first portion of the first prompt including the first context data; while performing ASR processing, determining, using the first LLM, a second portion of the first encoded representations corresponding to the first portion of the first prompt; after determining the first transcript, determining a third portion of the first prompt including the first transcript; and determining, using the first LLM, a fourth portion of the first encoded representations corresponding to the third portion of the first prompt.
- 4 . The computer-implemented method of claim 1 , further comprising: performing, using the first LLM and the first encoded representations, a first iteration of processing to determine a response to the first user input, the first iteration of processing resulting in generation of first processing data; determining third encoded representations corresponding to the first processing data; storing, at the cache, the third encoded representations; performing, using the first LLM, the first encoded representations and the third encoded representations, a second iteration of processing to determine the first response corresponding to the first user input; based on the first LLM determining the first response, discarding, from the cache, the third encoded representations; determining fourth encoded representations corresponding to the first response; storing, at the cache, the fourth encoded representations; and determining the second encoded representations corresponding to the second prompt by: determining, from the cache, the first encoded representations and the fourth encoded representations; and generating, using the first LLM, fifth encoded representations corresponding to the second transcript.
- 5 . A computer-implemented method comprising: receiving first input data corresponding to a first user input associated with a first session identifier; generating a first prompt including at least the first input data; determining, using a first LLM from a group of LLMs, first encoded data corresponding to the first prompt, the first LLM associated with a container identifier for a container that includes the first LLM and corresponding components to enable processing by the first LLM; storing, at a first cache associated with the first LLM, the first encoded data; based on storing the first encoded data at the first cache, storing first data representing an association between the first session identifier and the container identifier; receiving second input data corresponding to a second user input associated with the first session identifier; generating a second prompt including at least the first input data and the second input data; based on the first data and based on the second input data being associated with the first session identifier, selecting, from the group of LLMs, the first LLM to process the second prompt; determining, using the first LLM and the first cache, second encoded data corresponding to the second prompt; storing, at the first cache, at least a portion of the second encoded data; and using the first LLM and at least the second encoded data, determining first output data responsive to the second user input.
- 6 . The computer-implemented method of claim 5 , wherein determining the second encoded data corresponding to the second prompt further comprises: determining, using the stored first encoded data, that a first portion of the second prompt corresponds to a second portion of the first encoded data; determining, using the first LLM, third encoded data corresponding to a third portion of the second prompt different than the first portion of the second prompt; and determining the second encoded data to include the second portion of the first encoded data and the third encoded data.
- 7 . The computer-implemented method of claim 5 , further comprising: receiving audio data representing the first user input; in response to receiving the audio data, determining context data corresponding to the first user input; determining a first portion of the first prompt including at least the context data; performing automatic speech recognition (ASR) processing using the audio data to determine the first input data representing a transcript of the first user input; and while performing ASR processing, determining, using the first LLM, third encoded data corresponding to the first portion of the first prompt.
- 8 . The computer-implemented method of claim 5 , further comprising: performing, using the first LLM and the first encoded data, a first iteration of processing to determine a response to the first user input, the first iteration of processing resulting in generation of first processing data; determining third encoded data corresponding to the first processing data; storing, at the first cache, the third encoded data; and performing, using the first LLM, the first encoded data and the third encoded data, a second iteration of processing to determine second output data responsive to the first user input.
- 9 . The computer-implemented method of claim 8 , further comprising: based on the first LLM determining the second output data responsive to the first user input, discarding, from the first cache, the third encoded data; storing, in the first cache, fourth encoded data corresponding to the second output data; determining the second prompt to include the first input data, the second output data and the second input data; and determining the second encoded data corresponding to the second prompt by: determining, from the first cache, the first encoded data and the fourth encoded data; and generating, using the first LLM, fifth encoded data corresponding to the second input data.
- 10 . The computer-implemented method of claim 5 , further comprising: determining that a response to the first user input is generated; before determining the second prompt and based on the response to the first user input being generated, discarding, from the first cache, a first portion of the first encoded data; and determining the second encoded data based on a second portion of the first encoded data that is stored at the first cache.
- 11 . The computer-implemented method of claim 5 , further comprising: receiving third input data representing a third user input associated with a second session identifier; generating a third prompt including at least the third input data, the third prompt being a third input for a LLM to determine a response to the third user input; determining, using session data, that the second session identifier is unassociated with a container identifier for a LLM; based on determining that the second session identifier is unassociated with a container identifier for a LLM, selecting, from the group of LLMs, a second LLM to process the third prompt; determining, using the second LLM, third encoded data corresponding to the third prompt; and storing, at a second cache associated with the second LLM, the third encoded data.
- 12 . The computer-implemented method of claim 5 , further comprising: receiving third input data representing a third user input; determining a third prompt including at least the third input data; determining that the third input data is similar to fourth input data representing a past user input; determining that the fourth input data was previously processed using a second LLM of the group of LLMs; and based on the third input data being similar to the fourth input data and the fourth input data being previously processed using the second LLM, selecting the second LLM to process the third prompt.
- 13 . A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input data corresponding to a first user input of a first dialog session, the first input data associated with a first session identifier for the first dialog session; generate a first prompt including at least the first input data, the first prompt being a first input for a large language model (LLM) to determine a response to the first user input; determine, using a first LLM from a group of LLMs, first encoded data corresponding to the first prompt, the first LLM being associated with a first container identifier; store, at a first cache associated with the first LLM, the first encoded data; based on storing the first encoded data at the first cache, store first data representing a first association between the first dialog session and the first container identifier; receive second input data corresponding to a second user input of the first dialog session, the second input data associated with the first session identifier; generate a second prompt including at least the first input data and the second input data, the second prompt being a second input for a LLM to determine a response to the second user input; based on the first data indicating an association between the first session identifier and the first container identifier and based on the second input data being associated with the first session identifier, select from the group of LLMs, the first LLM to process the second prompt; determine, using the first LLM and the first cache, second encoded data corresponding to the second prompt; store, at the first cache, at least a portion of the second encoded data; and use the first LLM and at least the second encoded data, determining first output data responsive to the second user input.
- 14 . The system of claim 13 , wherein the instructions that cause the system to determine the second encoded data corresponding to the second prompt, further causes the system to: determine, using the stored first encoded data, that a first portion of the second prompt corresponds to a second portion of the first encoded data; determine, using the first LLM, third encoded data corresponding to a third portion of the second prompt different than the first portion of the second prompt; and determine the second encoded data to include the second portion of the first encoded data and the third encoded data.
- 15 . The system of claim 13 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: receive audio data representing the first user input; in response to receiving the audio data, determine context data corresponding to the first user input; determine a first portion of the first prompt including at least the context data; perform automatic speech recognition (ASR) processing using the audio data to determine the first input data representing a transcript of the first user input; and while performing ASR processing, determine, using the first LLM, third encoded data corresponding to the first portion of the first prompt.
- 16 . The system of claim 13 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: perform, using the first LLM and the first encoded data, a first iteration of processing to determine a response to the first user input, the first iteration of processing resulting in generation of first processing data; determine third encoded data corresponding to the first processing data; store, at the first cache, the third encoded data; and perform, using the first LLM, the first encoded data and the third encoded data, a second iteration of processing to determine second output data responsive to the first user input.
- 17 . The system of claim 16 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: based on the first LLM determining the second output data responsive to the first user input, discard, from the first cache, the third encoded data; store, in the first cache, fourth encoded data corresponding to the second output data; determine the second prompt to include the first input data, the second output data and the second input data; and determine the second encoded data corresponding to the second prompt by: determining, from the first cache, the first encoded data and the fourth encoded data; and generating, using the first LLM, fifth encoded data corresponding to the second input data.
- 18 . The system of claim 13 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: determine that a response to the first user input is generated; before determining the second prompt and based on the response to the first user input being generated, discard, from the first cache, a first portion of the first encoded data; and determine the second encoded data based on a second portion of the first encoded data that is stored at the first cache.
- 19 . The system of claim 13 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: receive third input data representing a third user input of a second dialog session, the third input data associated with a second session identifier for the second dialog session; generate a third prompt including at least the third input data, the third prompt being a third input for a LLM to determine a response to the third user input; determine, using session data, that the second dialog session is unassociated with a container identifier for a LLM; based on determining that the second dialog session is unassociated with a container identifier for a LLM, select, from the group of LLMs, a second LLM to process the third prompt; determine, using the second LLM, third encoded data corresponding to the third prompt; and store, at a second cache associated with the second LLM, the third encoded data.
- 20 . The system of claim 13 , wherein the at least one memory includes further instructions that, when executed by the at least one processor, further cause the system to: receive third input data representing a third user input; determine a third prompt including at least the third input data; determine that the third input data is similar to fourth input data representing a past user input; determine that the fourth input data was previously processed using a second LLM of the group of LLMs; and based on the third input data being similar to the fourth input data and the fourth input data being previously processed using the second LLM, select the second LLM to process the third prompt.
Description
BACKGROUND Natural language processing systems have progressed to the point where humans can interact with computing devices using their voices and natural language textual input. Such systems employ techniques to identify the words spoken and written by a human user based on the various qualities of received input data. Speech recognition combined with natural language understanding processing techniques enable speech-based user control of computing devices to perform tasks based on the user's spoken inputs. Such processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions. BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings. FIG. 1 is a conceptual diagram illustrating an example system for cache management for large language model inference, according to embodiments of the present disclosure. FIG. 2 is a conceptual diagram illustrating an example configuration of a system involving use of multiple LLM containers, according to embodiments of the present disclosure. FIG. 3 is a flowchart illustrating an example process for using cache data that matches a portion of a prompt, according to embodiments of the present disclosure. FIG. 4A-4B are signal flow diagrams illustrating an example process of asynchronous prompt encoding and automatic speech recognition (ASR) processing, according to embodiments of the present disclosure. FIG. 5 conceptually illustrates tokens and data that are stored in a cache for dialog turns and model processing iterations, according to embodiments of the present disclosure. FIGS. 6A-6B show example user inputs and corresponding model processing data that may be stored in a cache, according to embodiments of the present disclosure. FIG. 6C is a flowchart illustrating an example process that may be performed to store data in a cache and discard data from the cache, according to embodiments of the present disclosure. FIG. 7 is a conceptual diagram illustrating example components and processing for determining one or more components configured to perform an action associated with the task, according to embodiments of the present disclosure. FIG. 8 is a conceptual diagram illustrating example components and processing of a plan generation component, according to embodiments of the present disclosure. FIG. 9 is a conceptual diagram illustrating example components and processing of an LLM shortlister, according to embodiments of the present disclosure. FIG. 10 is a conceptual diagram illustrating example component and processing of a response arbitration component, according to embodiments of the present disclosure. FIG. 11 is a conceptual diagram of components of the system, according to embodiments of the present disclosure. FIG. 12 is a conceptual diagram illustrating components that may be included in a device, according to embodiments of the present disclosure. FIG. 13 is a block diagram conceptually illustrating example components of a device according to embodiments of the present disclosure. FIG. 14 is a block diagram conceptually illustrating example components of a system according to embodiments of the present disclosure. FIG. 15 illustrates an example of a computer network for use with the speech processing system. DETAILED DESCRIPTION Automatic speech recognition (ASR) is a field of computer science, artificial intelligence, and linguistics concerned with transforming audio data associated with speech into a token or other textual representation of that speech. Similarly, natural language understanding (NLU) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to derive meaning from natural language inputs (such as spoken inputs). ASR and NLU are often used together as part of a language processing component of a system. Text-to-speech (TTS) is a field of computer science concerning transforming textual and/or other data into audio data that is synthesized to resemble human speech. Natural language generation (NLG) is a field of artificial intelligence concerned with automatically transforming data into natural language (e.g., English) content. Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. LM can be used to perform various tasks including generative tasks that involve generating data rather than discriminating between given classes. Language models analyze bodies of text data to provide a basis for their word predictions. The language models are generative models. In some embodiments, the language models may be a large language model (LLM). An LLM is an advanced artificial intelligence system designed to process, understand, and generate human-like text based