US-20260127431-A1 - AUTOREGRESSIVELY GENERATING SEQUENCES OF DATA ELEMENTS DEFINING ACTIONS TO BE PERFORMED BY AN AGENT

US20260127431A1US 20260127431 A1US20260127431 A1US 20260127431A1US-20260127431-A1

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting actions to be performed by an agent to interact with an environment using an action selection neural network. In one aspect, a method comprises, at each time step in a sequence of time steps: generating a current representation of a state of a task being performed by the agent in the environment as of the current time step as a sequence of data elements; autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.

Inventors

Scott Ellison Reed
Sergio Gomez
Ashley Deloris Edwards
Jacob Bruce
Gabriel Barth-Maron
Konrad ZOLNA
Emilio Parisotto
Tom Erez
Alexander Novikov
Jack William Rae
Misha Man Ray Denil
Joao Ferdinando Gomes de Freitas
Oriol Vinyals

Assignees

GDM HOLDING LLC

Dates

Publication Date: 20260507
Application Date: 20251118

Claims (20)

1 .- 26 . (canceled)
27 . A method performed by one or more computers for selecting actions to be performed by an agent to interact with an environment using an action selection neural network, the method comprising: receiving a prompt defining a task to be performed by the agent; initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt; and at each time step in a sequence of time steps: autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step, comprising, for each position starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for the position in the sequence of data elements representing the current action in accordance with the score distribution; and updating the current representation of the state of the task by concatenating the selected data element for the position to the current representation of the state of the task; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.
28 . The method of claim 27 , wherein initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt comprises: generating a representation of the prompt as a sequence of data elements; and concatenating the representation of the prompt to a representation of a current observation characterizing a state of the environment.
29 . The method of claim 28 , wherein the prompt includes a sequence of text and generating the representation of the prompt as a sequence of data elements comprises: generating a representation of the sequence of text as a sequence of tokens from a predefined set of tokens; and mapping each token in the sequence of tokens to a corresponding numerical value in accordance with a predefined mapping.
30 . The method of claim 27 , wherein the prompt comprises one or more of: a demonstration of the task, a goal observation characterizing a goal state of the environment, or a sequence of text in a natural language that provides instructions related to the task.
31 . The method of claim 28 , wherein the prompt comprises a demonstration of the task and generating the representation of the prompt as a sequence of data elements comprises combining a return achieved by performing the task during the task demonstration with the representation of the prompt as a sequence of data elements, wherein the return defines a cumulative measure of rewards achieved as a result of performing the task during the task demonstration.
32 . The method of claim 27 , wherein the prompt comprises a sequence of natural language text and the task comprises generating one or more sequences of natural language text responsive to the prompt.
33 . The method of claim 27 , wherein the prompt comprises a natural language description of desired computer code, and the task comprises generating a sequence of computer code that fits the natural language description.
34 . The method of claim 27 , wherein the prompt defines an input sequence of computer code, and the task comprises generating an output sequence of computer code that is a completion of the input sequence of computer code.
35 . A system comprising: one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving a prompt defining a task to be performed by the agent; initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt; and at each time step in a sequence of time steps: autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step, comprising, for each position starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for the position in the sequence of data elements representing the current action in accordance with the score distribution; and updating the current representation of the state of the task by concatenating the selected data element for the position to the current representation of the state of the task; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.
36 . The system of claim 35 , wherein initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt comprises: generating a representation of the prompt as a sequence of data elements; and concatenating the representation of the prompt to a representation of a current observation characterizing a state of the environment.
37 . The system of claim 36 , wherein the prompt includes a sequence of text and generating the representation of the prompt as a sequence of data elements comprises: generating a representation of the sequence of text as a sequence of tokens from a predefined set of tokens; and mapping each token in the sequence of tokens to a corresponding numerical value in accordance with a predefined mapping.
38 . The system of claim 35 , wherein the prompt comprises one or more of: a demonstration of the task, a goal observation characterizing a goal state of the environment, or a sequence of text in a natural language that provides instructions related to the task.
39 . The system of claim 36 , wherein the prompt comprises a demonstration of the task and generating the representation of the prompt as a sequence of data elements comprises combining a return achieved by performing the task during the task demonstration with the representation of the prompt as a sequence of data elements, wherein the return defines a cumulative measure of rewards achieved as a result of performing the task during the task demonstration.
40 . The system of claim 35 , wherein the prompt comprises a sequence of natural language text and the task comprises generating one or more sequences of natural language text responsive to the prompt.
41 . The system of claim 35 , wherein the prompt comprises a natural language description of desired computer code, and the task comprises generating a sequence of computer code that fits the natural language description.
42 . The system of claim 35 , wherein the prompt defines an input sequence of computer code, and the task comprises generating an output sequence of computer code that is a completion of the input sequence of computer code.
43 . One or more non-transitory computer storage media encoded with computer program instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: receiving a prompt defining a task to be performed by the agent; initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt; and at each time step in a sequence of time steps: autoregressively generating a sequence of data elements representing a current action to be performed by the agent at the current time step, comprising, for each position starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for the position in the sequence of data elements representing the current action in accordance with the score distribution; and updating the current representation of the state of the task by concatenating the selected data element for the position to the current representation of the state of the task; and after autoregressively generating the sequence of data elements representing the current action, causing the agent to perform the current action at the current time step.
44 . The one or more non-transitory computer storage media of claim 43 , wherein initializing a current representation of a state of the task to be performed by the agent as a sequence of data elements based on the prompt comprises: generating a representation of the prompt as a sequence of data elements; and concatenating the representation of the prompt to a representation of a current observation characterizing a state of the environment.
45 . The one or more non-transitory computer storage media of claim 44 , wherein the prompt includes a sequence of text and generating the representation of the prompt as a sequence of data elements comprises: generating a representation of the sequence of text as a sequence of tokens from a predefined set of tokens; and mapping each token in the sequence of tokens to a corresponding numerical value in accordance with a predefined mapping.

Description

CLAIM OF PRIORITY This application is a continuation of U.S. Ser. No. 18/292,165 filed Jan. 25, 2024, which is a U.S. National Application of PCT/EP2022/072731 filed Aug. 12, 2022, claims priority to U.S. Provisional Application Ser. No. 63/341,343, filed on May 12, 2022, the entire contents of which are hereby incorporated by reference. BACKGROUND This specification relates to processing data using machine learning models. Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model. Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output. SUMMARY This specification describes an action selection system implemented as computer programs on one or more computers in one or more locations for controlling an agent interacting with an environment to perform a task. Throughout this specification, a “data element” can refer to, e.g., a numerical value (e.g., an integer or floating point numerical value) or an embedding. An embedding refers to an ordered collection of numerical values, e.g., a vector, matrix, or other tensor of numerical values. According to a first aspect there is provided a method performed by one or more computers for selecting actions to be performed by an agent to interact with an environment using an action selection neural network, in particular a trained action selection neural network. The method comprises, at each time step in a sequence of time steps: generating a current representation of a state of a task being performed by the agent in the environment as of the current time step as a (first) sequence of data elements, e.g. from a current observation characterizing a state of the environment. The method also comprises autoregressively generating a (second) sequence of data elements representing a current action to be performed by the agent at the current time step. For example the (second) sequence of data elements can comprise a plurality of action data elements that collectively represent the action to be performed by the agent. In implementations autoregressively generating the (second) sequence of data elements comprises, for each position (in the second sequence of data elements) starting from a first position in the sequence of data elements representing the current action: processing the current representation of the state of the task using the action selection neural network to generate a score distribution over a set of possible data elements; selecting a data element for the position in the sequence of data elements representing the current action in accordance with the score distribution; and updating the current representation of the state of the task by concatenating the selected (action) data element for the position to the current representation of the state of the task. That is, the updated current representation of the state of the task, i.e. the (first) sequence of data elements, is updated for the autoregressive generating of the (second) sequence of data elements, in particular for processing the current (now updated) representation of the state of the task to select the (action) data element for the next position. After autoregressively generating the sequence of data elements representing the current action the method causes the agent to perform the current action at the current time step. The method may then update the current representation of the state of the task using the current observation for the next time step. In some implementations, for each time step in the sequence of time steps, generating the current representation of the state of the task as of the current time step comprises: receiving a current observation characterizing a state of the environment at the current time step; generating a representation of the current observation as a sequence of data elements; and including the representation of the current observation as a sequence of data elements in the current representation of the state of the task as of the current time step, e.g. by concatenating the (first) sequence of data elements representing the current state of the task, and the representation of the current observation as a sequence of data elements. In some implementations, the current observation is defined by a collection of numerical values, and generating the representation of the current observation as a sequence of data elements comprises: concatenating each numerical value in the collection of numerical values defining the current observation into a sequence of numerical values in a predefined