EP-4735991-A1 - CAUSING PERFORMANCE OF AN ACTION BASED ON NATURAL LANGUAGE USER INPUT

EP4735991A1EP 4735991 A1EP4735991 A1EP 4735991A1EP-4735991-A1

Abstract

Techniques for generating a prompt for a language model to determine an action responsive to a user input, are described. In some embodiments, the system receives a user input, determines one or more application programming interfaces (APIs) configured to perform actions that are relevant to the user input and exemplars representing examples of using the APIs with respect to user inputs similar to the current user input. The system further determines device states of devices that are determined to be related to the user input and also determines other contextual information (e.g., weather information, time of day, geographic location, etc.). The system generates a prompt including the user input, the APIs, the exemplars, the device states, and the other contextual information. A language model processes the prompt to determine an action responsive to the user input and the system causes performance of the action.

Inventors

WANG, Hann
METALLINOU, Angeliki
Gens, Melanie C B
BISWAS, ARIJIT
SHI, YING

Assignees

Amazon Technologies, Inc.

Dates

Publication Date: 20260506
Application Date: 20240201

Claims (15)

1. A computer-implemented method comprising: receiving first natural language input data representing a first user input; based on the first natural language input data, determining a first set of actions available with respect to the first user input, the first set of actions including at least a first action; based on the first natural language input data and the first set of actions including the first action, determining first data including a first example user input similar to the first user input and a first system response to be generated in response to the first example user input; based on the first set of actions including the first action, determining second data associated with performing the first action; determining a first prompt including the first natural language input data, the first set of actions, the first data, and the second data; processing, using a language model, the first prompt to generate first output data indicating the first action is to be performed; and causing performance of the first action.
2. The computer-implemented method of claim 1, further comprising: receiving first response data associated with performance of the first action; processing the first prompt and the first response data to determine a second prompt associated with the first user input and the first response data; processing, using the language model, the second prompt to generate second model output data indicating a first response is to be presented; and causing presentation of the first response.
3. The computer-implemented method of claim 1 or 2, wherein: the first action corresponds to outputting audio data corresponding to first natural language data included in the first output data, causing performance of the first action comprises causing a first component to perform the first action, the first component configured to perform text-to-speech processing, and the method further comprises: receiving, from the first component, first response data indicating performance of the first action; and based on receiving the first response data and the first action corresponding to outputting the audio data, ceasing further processing of the first user input by the language model.
4. The computer-implemented method of claim 1, 2, or 3, wherein the first user input corresponds to a first request to power on a first device, and the method further comprises: based on the first user input corresponding to the first request to power on the first device, determining third data representing a second request for: a set of actions associated with powering on a device, exemplar data associated with the set of actions, the exemplar data including an example user input similar to the first user input and a system response to be generated in response to the example user input, and a device state of the first device, wherein the first set of actions and the first data are determined further based on the first data.
5. The computer-implemented method of claim 1, 2, 3, or 4, wherein determining the second data further comprises: determining the first action is associated with performance of an action using a device; based on determining the first action is associated with performance of the action using the device: determining a user profile associated with the first user input, determining a first device associated with the user profile, wherein the first device is capable of performing a second action associated with the first user input, and determining third data representing a state of the first device, wherein the second data includes the third data; and determining fourth data representing a context associated with the first user input, wherein the second data includes the fourth data.
6. The computer-implemented method of claim 1, 2, 3, 4, or 5, further comprising: determining an action definition of the first action; determining that the action definition is semantically similar to the first user input; based on determining that the action definition is semantically similar to the first user input, including the first action in the first set of actions, wherein determining the first data is based on determining the first example user input is semantically similar to the first user input; and identifying, from amongst a plurality of system responses, the first system response based on the first system response being associated with the first action and the first example user input.
7. The computer-implemented method of claim 1, 2, 3, 4, 5, or 6, further comprising: determining third data representing a stopping condition, wherein the language model is to cease processing of a user input in response to determining the stopping condition has been satisfied; processing the first output data to determine that the first action corresponds to the stopping condition; and ceasing further processing of the first user input, by the language model, based on the first action satisfying the stopping condition.
8. The computer-implemented method of claim 1, 2, 3, 4, 5, 6, or 7, further comprising: determining third data representing a stopping condition, wherein the language model is to cease processing of a user input in response to generating output data absent of an action to be performed; receiving first response data associated with performance of the first action; processing the first prompt and the first response data to determine a second prompt associated with the first user input and the first response data; processing, using the language model, the second prompt to generate second model output data; determining the second model output data corresponds to the stopping condition; and based on the stopping condition, ceasing further processing of the first user input by the language model.
9. A computing system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the computing system to: receive first natural language input data representing a first user input; based on the first natural language input data, determine a first set of actions available with respect to the first user input, the first set of actions including at least a first action; based on the first natural language input data, determine at least a first example user input similar to the first user input; determine first data including the first example user input and a first system response to be generated in response to the first example user input; based on the first set of actions including the first action, determining second data associated with performing the first action; determine a first prompt including the first natural language input data, the first set of actions, the first data, and the second data; process, using a language model, the first prompt to generate first output data indicating the first action is to be performed; and cause performance of the first action.
10. The computing system of claim 9, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: receive first response data associated with performance of the first action; process the first prompt and the first response data to determine a second prompt associated with the first user input and the first response data; process, using the language model, the second prompt to generate second model output data indicating a first response is to be presented; and cause presentation of the first response.
11. The computing system of claim 9 or 10, wherein: the first action corresponds to outputting audio data corresponding to first natural language data included in the first output data, the instructions that cause the computing system to cause performance of the first action comprise further instructions that, when executed by the at least one processor, further cause the computing system to cause a first component to perform the first action, the first component configured to perform text-to-speech processing, and the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: receive, from the first component, first response data indicating performance of the first action; and based on receiving the first response data and the first action corresponding to outputting the audio data, cease further processing of the first user input by the language model.
12. The computing system of claim 9, 10, or 11, wherein the first user input corresponds to a first request to power on a first device, and wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: based on the first user input corresponding to the first request to power on the first device, determine third data representing a second request for: a set of actions associated with powering on a device, exemplar data associated with the set of actions, the exemplar data including an example user input similar to the first user input and a system response to be generated in response to the example user input, and a device state of the first device, wherein the first set of actions and the first data are determined further based on the first data.
13. The computing system of claim 9, 10, 11, or 12, wherein the instructions that cause the computing system to determine the second data comprise further instructions that, when executed by the at least one processor, further cause the computing system to: determine the first action is associated with performance of an action using a device; based on determining the first action is associated with performance of the action using the device: determine a user profile associated with the first user input, determine a first device associated with the user profile, wherein the first device is capable of performing a second action associated with the first user input, and determine third data representing a state of the first device, where the second data includes the third data; and determine fourth data representing a context associated with the first user input, wherein the second data includes the fourth data.
14. The computing system of claim 9, 10, 11, 12, or 13, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: determine an action definition of the first action; determine that the action definition is semantically similar to the first user input; based on determining that the action definition is semantically similar to the first user input, include the first action in the first set of actions, wherein determining the first data is based on determining the first example user input is semantically similar to the first user input; and identify, from amongst a plurality of system responses, the first system response based on the first system response being associated with the first action and the first example user input.
15. The computing system of claim 9, 10, 11, 12, 13, or 14, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: determine third data representing a stopping condition, wherein the language model is to cease processing of a user input in response to determining the stopping condition has been satisfied; process the first output data to determine that the first action corresponds to the stopping condition; and cease further processing of the first user input, by the language model, based on the first action satisfying the stopping condition.

Description

CAUSING PERFORMANCE OF AN ACTION BASED ON NATURAL LANGUAGE USER INPUT CROSS-REFERENCE TO RELATED APPLICATION DATA [0001] This application claims the benefit of and priority to U.S. Patent Application No. 18/345,455, filed June 30, 2023, and entitled “NATURAL LANGUAGE GENERATION,” in the names of Hann Wang, et al. The above patent application is herein incorporated by reference in its entirety. BACKGROUND [0002] Natural language processing systems have progressed to the point where humans can interact with computing devices using their voices and natural language textual input. Such systems employ techniques to identify the words spoken and written by a human user based on the various qualities of received input data. Speech recognition combined with natural language understanding processing techniques enable speech-based user control of computing devices to perform tasks based on the user’s spoken inputs. Speech recognition and natural language understanding processing techniques may be referred to collectively or separately herein as spoken language understanding (SLU) processing. SLU processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions. BRIEF DESCRIPTION OF DRAWINGS [0003] For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings. [0004] FIG. l is a conceptual diagram illustrating example components and processing of a system for generating a prompt usable by a language model to determine an action responsive to a user input, according to embodiments of the present disclosure. [0005] FIG. 2 is a conceptual diagram illustrating example processing of a knowledge provider component, according to embodiments of the present disclosure. [0006] FIG. 3 is a conceptual diagram illustrating further example components and processing of the system for generating the prompt, according to embodiments of the present disclosure. [0007] FIG. 4 is a conceptual diagram illustrating example processing performed by the system to cause performance of the action responsive to the user input, according to embodiments of the present disclosure. [0008] FIG. 5 is a conceptual diagram illustrating example processing of an action provider component, according to embodiments of the present disclosure. [0009] FIG. 6 is a conceptual diagram of components of the system, according to embodiments of the present disclosure. [0010] FIG. 7 is a conceptual diagram illustrating components that may be included in a device, according to embodiments of the present disclosure. [0011] FIG. 8 is a block diagram conceptually illustrating example components of a device, according to embodiments of the present disclosure. [0012] FIG. 9 is a block diagram conceptually illustrating example components of a system, according to embodiments of the present disclosure. [0013] FIG. 10 illustrates an example of a network for use with the overall system, according to embodiments of the present disclosure. DETAILED DESCRIPTION [0014] Automatic speech recognition (ASR) is a field of computer science, artificial intelligence, and linguistics concerned with transforming audio data associated with speech into a token or other textual representation of that speech. Similarly, natural language understanding (NLU) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to derive meaning from natural language inputs (such as spoken inputs). ASR and NLU are often used together as part of a language processing component of a system. Text-to- speech (TTS) is a field of computer science concerning transforming textual and/or other data into audio data that is synthesized to resemble human speech. Natural language generation (NLG) is a field of artificial intelligence concerned with automatically transforming data into natural language (e.g., English) content. In some embodiments, NLU processing and NLG processing may be logical subcomponents of natural language processing (NLP). [0015] Certain systems may be configured to respond to natural language (e.g., spoken or typed) user inputs. For example, in response to the user input “what is today’s weather,” the system may output weather information for the user’s geographic location. As another example, in response to the user input “what are today’s top stories,” the system may output one or more news stories. For further example, in response to the user input “tell me a joke,” the system may output a joke to the user. As another example, in response to the user input “book me a flight to Seattle,” the system may book a flight to Seattle and output information of the booked flight. For further example, in response to the user input “lock the front door,” the system may actuate a “front door” smart lock to a locked position. [0016] A system may receive a user