US-12626698-B1 - Component shortlisting

US12626698B1US 12626698 B1US12626698 B1US 12626698B1US-12626698-B1

Abstract

Techniques for identifying components (e.g., application program interfaces (APIs)) relevant for a large language model (LLM) prompt are described. The shortlister includes one or more proposers that each select n components for performing an instant task. The n components from each proposer may be merged and reranked to produce a ranked list of K components for inclusion in the LLM prompt. The shortlister proposers can include, among others, a request-based proposer, query-based proposer, and/or rule-based proposer. In particular, the query-based proposer may utilize a LLM to generate a query to search component descriptions to identify n components the proposer determines are usable to perform the instant task.

Inventors

Mohammad Kachuee
Vaibhav Kumar
Yibo Yao
Saurabh Gupta
Xiang Li
Puyang Xu
Xing Fan
Corey Daniel Rogers
Prakhar Bhardwaj

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20230919

Claims (20)

1 . A computer-implemented method comprising: receiving audio data corresponding to a first spoken user input associated with user profile data; processing the audio data to determine a transcript of the first spoken user input; generating a first prompt for a first search query to identify one or more application program interfaces (APIs) based on the transcript and the user profile data; processing, using a first large language model (LLM), the first prompt to generate the first search query to identify one or more APIs; using a first machine learning (ML) trained model, processing the first search query generated by the LLM and a plurality of API descriptions to determine a set of APIs capable of performing a task with respect to the first spoken user input; generating a second prompt to identify at least one API, from the set of APIs, to perform the task, the second prompt including the user profile data; processing, using a second LLM, the second prompt to select, from the set of APIs, a first API to perform the task; based on the second LLM selecting the first API, performing, using the first API, the task to determine output data responsive to the first spoken user input; and causing presentation of the output data.
2 . The computer-implemented method of claim 1 , further comprising: receiving metric data associated with the first API, wherein the metric data indicates at least one of: a number of times the first API was called and provided a response; a number of times the first API was called and resulted in positive user feedback; and a number of times the first API was called with respect to user inputs associated with the user profile data; and based at least in part on the metric data, including the first API in the set of APIs.
3 . The computer-implemented method of claim 1 , further comprising: determining a first location indicated in the user profile data; determining a second location associated with a second API; determining the second location is different from the first location; and excluding, from the plurality of API descriptions, an API description of the second API based on the second location being different from the first location.
4 . The computer-implemented method of claim 1 , further comprising: using a second ML trained model, processing the transcript and a stored second spoken user input for invoking a second API to determine the second API is configured to perform the task; and generating the second prompt to further indicate the second API.
5 . A computer-implemented method comprising: receiving first input data representing a first user input; processing, using a first large language model (LLM), the first input data to generate a first query to search component descriptions to identify one or more components corresponding to the first user input; using the first query generated by the first LLM, determining a set of components corresponding to the first user input; generating a first LLM prompt indicating the set of components; processing, using a second LLM, the first LLM prompt to select, from the set of components, a first component to perform a task corresponding to the first user input; generating, using at least the first component, output data responsive to the first user input; and presenting the output data.
6 . The computer-implemented method of claim 5 , further comprising: receiving metric data associated with the first component, wherein the metric data indicates at least one of: a number of times the first component was called and provided a response; a number of times the first component was called and resulted in positive user feedback; and a number of times the first component was called with respect to user inputs associated with user profile data associated with the first user input; and based at least in part on the metric data, generating the first LLM prompt to indicate the first component.
7 . The computer-implemented method of claim 5 , further comprising: determining a first location indicated in user profile data associated with the first user input; determining a second location associated with a second component; determining the second location is different from the first location; and based on the second location being different from the first location, excluding the second component from the set of components.
8 . The computer-implemented method of claim 5 , further comprising: based on a stored second user input for invoking a second component, determining the second component is configured to perform the task; and generating the first LLM prompt to further indicate the second component.
9 . The computer-implemented method of claim 8 , further comprising: receiving audio data corresponding to the first user input; performing automatic speech recognition (ASR) processing on the audio data to generate ASR results; determining the second component is configured to perform the task based on the stored second user input corresponding to the ASR results; and determining the set of components using the first query at least partially in parallel or after determining the second component is configured to perform the task.
10 . The computer-implemented method of claim 5 , further comprising: generating a second LLM prompt for the first query; and processing, using the first LLM, the second LLM prompt to determine the first query.
11 . The computer-implemented method of claim 5 , wherein the first user input is received from a first device, and wherein the computer-implemented method further comprises: determining a first device type of the first device; determining a second device type associated with a second component; determining the second device type is different from the first device type; and excluding the second component from the first LLM prompt based on the second device type being different from the first device type.
12 . The computer-implemented method of claim 5 , further comprising: storing first data indicating the first component was selected to perform the task with respect to the first user input; receiving second data representing a second user input; determining the second user input corresponds to the first user input; and based on the first data and the second user input corresponding to the first user input, determining the first component is configured to perform a task with respect to the second user input.
13 . A computing system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the computing system to: receive first input data representing a first user input; process, using a first large language model (LLM), the first input data to generate a first query to search component descriptions to identify one or more components corresponding to the first user input; using the first query generated by the first LLM, determine a set of components corresponding to the first user input; generate a first LLM prompt indicating the set of components; process, using a second LLM, the first LLM prompt to select, from the set of components, a first component to perform a task corresponding to the first user input; generate, using at least the first component, output data responsive to the first user input; and present the output data.
14 . The computing system of claim 13 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: receive metric data associated with the first component, wherein the metric data indicates at least one of: a number of times the first component was called and provided a response; a number of times the first component was called and resulted in positive user feedback; and a number of times the first component was called with respect to user inputs associated with user profile data associated with the first user input; and based at least in part on the metric data, generate the first LLM prompt to indicate the first component.
15 . The computing system of claim 13 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: determine a first location indicated in user profile data associated with the first user input; determine a second location associated with a second component; determine the second location is different from the first location; and based on the second location being different from the first location, exclude the second component from the set of components.
16 . The computing system of claim 13 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: based on a stored second user input for invoking a second component, determine the second component is configured to perform the task; and generate the first LLM prompt to further indicate the second component.
17 . The computing system of claim 16 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: receive audio data corresponding to the first user input; perform automatic speech recognition (ASR) processing on the audio data to generate ASR results; determine the second component is configured to perform the task based on the stored second user input corresponding to the ASR results; and determine the set of components using the first query at least partially in parallel or after determining the second component is configured to perform the task.
18 . The computing system of claim 13 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: generate a second LLM prompt for the first query; and process, using the first LLM, the second LLM prompt to determine the first query.
19 . The computing system of claim 13 , wherein the first user input is received from a first device, and wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: determine a first device type of the first device; determine a second device type associated with a second component; determine the second device type is different from the first device type; and exclude the second component from the first LLM prompt based on the second device type being different from the first device type.
20 . The computing system of claim 13 , wherein the at least one memory further comprises instruction that, when executed by the at least one processor, further cause the computing system to: store first data indicating the first component was selected to perform the task with respect to the first user input; receive second data representing a second user input; determine the second user input corresponds to the first user input; and based on the first data and the second user input corresponding to the first user input, determine the first component is configured to perform a task with respect to the second user input.

Description

BACKGROUND Natural language processing systems have progressed to the point where humans can interact with computing devices using their voices and natural language textual input. Such systems employ techniques to identify the words spoken and written by a human user based on the various qualities of received input data. Speech recognition combined with natural language understanding processing techniques enable speech-based user control of computing devices to perform tasks based on the user's spoken inputs. Such processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions. BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings. FIG. 1A is a conceptual diagram illustrating a component shortlister, according to embodiments of the present disclosure. FIG. 1B is a conceptual diagram illustrating creation of a request index, according to embodiments of the present disclosure. FIG. 1C is a conceptual diagram illustrating a runtime embedding based search process, according to embodiments of the present disclosure. FIG. 1D is a conceptual diagram illustrating asynchronous calling of component proposers of the component shortlister, according to embodiments of the present disclosure. FIG. 1E is a conceptual diagram illustrating a process for caching outputs of component proposers for use with respect to a subsequent request(s), according to embodiments of the present disclosure. FIG. 2 is a conceptual diagram illustrating example components and processing for determining one or more components configured to perform an action associated with a task, according to embodiments of the present disclosure. FIG. 3 is a conceptual diagram illustrating example components and processing of a plan generation component, according to embodiments of the present disclosure. FIG. 4 is a conceptual diagram illustrating example components and processing to generate potential response data, according to embodiments of the present disclosure. FIG. 5 is a conceptual diagram illustrating example components and processing of a response arbitration component, according to embodiments of the present disclosure. FIG. 6 is a conceptual diagram of components of the system, according to embodiments of the present disclosure. FIG. 7 is a conceptual diagram illustrating example processing of an arbitrator component, according to embodiments of the present disclosure. FIG. 8 is a conceptual diagram illustrating components that may be included in a device, according to embodiments of the present disclosure. FIG. 9 is a block diagram conceptually illustrating example components of a user device, according to embodiments of the present disclosure. FIG. 10 is a block diagram conceptually illustrating example components of a system component, according to embodiments of the present disclosure. FIG. 11 illustrates an example of a computer network, according to embodiments of the present disclosure. DETAILED DESCRIPTION In-context learning enables large language models (LLMs) access to knowledge outside the training data. As used herein, “LLM” refers to an artificial intelligence model trained on a vast amount of text to understand existing context and generate original content. For example, a LLM may be a transformer-based seq2seq model involving an encoder-decoder architecture. In an encoder-decoder architecture, the encoder may produce a representation of an input text using a bidirectional encoding, and the decoder may use that representation to perform some task. For further example, a LLM may be a decoder-only architecture. The decoder-only architecture may use left-to-right (i.e., unidirectional) encoding of the input text. Example LLMs of the present disclosure include, but are not limited to, the Alexa Teacher Model, Generative Pre-trained Transformer (GPT) models (such as GPT-3), BigScience Large Open-science Open-access Multilingual Language Model (BLOOM), Language Model for Dialogue Applications model (LaMDA), Bard, Large Language Model Meta AI (LLaMA), Titan Foundational Model, Bidirectional Auto-Regressive Transformers (BART), and T5. The input to the LLM may be in the form of a prompt. A prompt may be a natural language input, for example, an instruction, for the LLM to generate an output according to the prompt. The output generated by the LLM may be a natural language output responsive to the prompt. In some embodiments, the output may be another type of data, such as audio, image, video, etc. The prompt, in some cases, can also include in-context learning information that may be used by the LLM to generate a response to the prompt. With in-context learning, the LLM can be directly instructed to use a set of candidate components (e.g., application program interfaces, or APIs), but it is limited b