Search

CN-116957055-B - Selecting actions using multimodal input

CN116957055BCN 116957055 BCN116957055 BCN 116957055BCN-116957055-B

Abstract

A method performed by one or more computers for selecting actions to be performed by an agent interacting with an environment includes, at each of a plurality of time steps, receiving a current text string in natural language that expresses a current task being performed by the agent, receiving a current observation that characterizes a current state of the environment, processing an input including the current text string and the current observation using a policy neural network to generate an action selection output, and selecting an action to be performed by the agent at the time step based on the action selection output, wherein the policy neural network has been trained from end to end using reinforcement learning.

Inventors

  • K. M. Herman
  • P. Brunson
  • F.G. Hill

Assignees

  • 渊慧科技有限公司

Dates

Publication Date
20260508
Application Date
20180605
Priority Date
20170605

Claims (20)

  1. 1. A method performed by one or more computers for selecting actions to be performed by an agent interacting with a real world environment, wherein the agent is a robot or a semi-autonomous or autonomous vehicle, the method comprising: at each of a plurality of time steps: receiving a current text string in natural language, the current text string expressing information about a current task being performed by an agent; receiving a current observation from a sensor of an agent that characterizes a current state of a real world environment, the current observation being an image of the real world environment; Processing an input comprising a current text string and a current observation using a strategic neural network to generate an action selection output, wherein the action is a control input controlling the agent; The process comprises: combining the current text string and the current observation to generate a combined embedding by the policy neural network based on values of the set of policy neural network parameters, and Generating action selection output based on the combined embedding by means of a policy neural network according to the values of the policy neural network parameter set, and Selecting an action to be performed by the agent at the time step based on the action selection output; where reinforcement learning has been used to train the strategic neural network from end to end, The method further includes generating, as a next current observation, a predicted image of the real world environment to be received after the agent performs the selected action using an auxiliary neural network that receives as input an embedding of the selected action and a current observation embedding.
  2. 2. The method of claim 1, further comprising: Receiving a current reward as a result of the agent performing an action in response to the current observation at each of the plurality of time steps, and Based on rewards received over the plurality of time steps, a strategic neural network is trained from end-to-end using reinforcement learning.
  3. 3. The method of claim 1, wherein combining, by a policy neural network, the current text string and a current observation according to values of a policy neural network parameter set to generate the combined embedding comprises: Processing the current text string using a language encoder model of the policy neural network to generate a current text embedding of the current text string; processing the current observation using an observation encoder neural network of a policy neural network to generate a current observation embedding of the current observation, and The current observation embedment and the current text embedment are combined to generate the combined embedment.
  4. 4. A method according to claim 3, wherein generating, by the policy neural network, an action selection output based on the combined embedding according to values of the policy neural network parameter set comprises: the combined embedding is processed using an action selection neural network of the policy neural network to generate an action selection output.
  5. 5. A method according to claim 3, wherein the speech coder model is a recurrent neural network.
  6. 6. A method according to claim 3, wherein the language encoder model is a bag of words encoder.
  7. 7. The method of claim 3, wherein the current observation embedding is a feature matrix of the current observation, and wherein the current text embedding is a feature vector of the current text string.
  8. 8. The method of claim 7, wherein combining the current view embedding and the current text embedding comprises: Flattening the feature matrix of the current view, and The flattened feature matrix and the feature vector of the current text string are spliced.
  9. 9. The method of claim 1, wherein at each of the plurality of time steps, the current text string is a natural language instruction for an agent to perform a current task.
  10. 10. The method of claim 1, wherein at each of the plurality of time steps: The action selection output defines a probability distribution over possible actions to be performed by the agent, and The act of selecting an agent to perform includes: The actions are sampled from the probability distribution or the action with the highest probability is selected according to the probability distribution.
  11. 11. The method of claim 1, wherein at each of the plurality of time steps: for each of a plurality of possible actions to be performed by the agent, the action selection output includes a respective Q value, the respective Q value being an estimate of a return that the agent has caused in response to performing the possible action in response to the current observation, and The act of selecting an agent to perform includes: The action with the highest Q value is selected.
  12. 12. The method of claim 1, wherein at each of the plurality of time steps: the action selection output identifies the best possible action to be performed by the agent in response to the current observation, and The act of selecting an agent to perform includes: The best possible action is selected.
  13. 13. The method of claim 1, wherein the current text string is the same for each observation received during execution of the current task.
  14. 14. The method of claim 1, wherein the current text string is different from a previous text string received during execution of the current task.
  15. 15. A system, comprising: One or more computers, and One or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for selecting an action to be performed by an agent interacting with a real world environment, wherein the agent is a robot or a semi-autonomous or autonomous vehicle, the operations comprising: at each of a plurality of time steps: receiving a current text string in natural language, the current text string expressing information about a current task being performed by an agent; receiving a current observation from a sensor of an agent that characterizes a current state of a real world environment, the current observation being an image of the real world environment; Processing an input comprising a current text string and a current observation using a strategic neural network to generate an action selection output, wherein the action is a control input controlling the agent; The process comprises: combining the current text string and the current observation to generate a combined embedding by the policy neural network based on values of the set of policy neural network parameters, and Generating action selection output based on the combined embedding by means of a policy neural network according to the values of the policy neural network parameter set, and Selecting an action to be performed by the agent at the time step based on the action selection output; where reinforcement learning has been used to train the strategic neural network from end to end, The operations further include generating, as a next current observation, a predicted image of the real world environment to be received after the agent performs the selected action using an auxiliary neural network that receives as input an embedding of the selected action and a current observation embedding.
  16. 16. One or more non-transitory computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations for selecting an agent to perform to interact with a real-world environment, wherein the agent is a robot or a semi-autonomous or autonomous vehicle, the operations comprising: at each of a plurality of time steps: receiving a current text string in natural language, the current text string expressing information about a current task being performed by an agent; receiving a current observation from a sensor of an agent that characterizes a current state of a real world environment, the current observation being an image of the real world environment; Processing an input comprising a current text string and a current observation using a strategic neural network to generate an action selection output, wherein the action is a control input controlling the agent; The process comprises: combining the current text string and the current observation to generate a combined embedding by the policy neural network based on values of the set of policy neural network parameters, and Generating action selection output based on the combined embedding by means of a policy neural network according to the values of the policy neural network parameter set, and Selecting an action to be performed by the agent at the time step based on the action selection output; where reinforcement learning has been used to train the strategic neural network from end to end, The operations further include generating, as a next current observation, a predicted image of the real world environment to be received after the agent performs the selected action using an auxiliary neural network that receives as input an embedding of the selected action and a current observation embedding.
  17. 17. The non-transitory computer storage medium of claim 16, wherein the operations further comprise: Receiving a current reward as a result of the agent performing an action in response to the current observation at each of the plurality of time steps, and Based on rewards received over the plurality of time steps, a strategic neural network is trained from end-to-end using reinforcement learning.
  18. 18. The non-transitory computer storage medium of claim 16, wherein combining, by the policy neural network, the current text string and the current observation according to values of a policy neural network parameter set to generate the combined embedding comprises: Processing the current text string using a language encoder model of the policy neural network to generate a current text embedding of the current text string; processing the current observation using an observation encoder neural network of a policy neural network to generate a current observation embedding of the current observation, and The current observation embedment and the current text embedment are combined to generate the combined embedment.
  19. 19. The non-transitory computer storage medium of claim 17, wherein generating, by the policy neural network, an action selection output based on the combined embedding according to values of a policy neural network parameter set comprises: the combined embedding is processed using an action selection neural network of the policy neural network to generate an action selection output.
  20. 20. The non-transitory computer storage medium of claim 18, wherein the speech coder model is a recurrent neural network.

Description

Selecting actions using multimodal input The application is a divisional application of an application patent application with the application date of 2018, 06 month 05, the application number of 201880026852.4 and the name of 'using multi-mode input selection action'. Background The present description relates to reinforcement learning. In a reinforcement learning system, an agent interacts with an environment by performing actions selected by the reinforcement learning system in response to receiving observations characterizing a current state of the environment. Some reinforcement learning systems select actions that an agent will perform in response to receiving a given observation in accordance with the output of the neural network. Neural networks are machine-learning models that employ one or more layers of nonlinear units to predict the output of a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to the output layer. The output of each hidden layer serves as an input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received inputs according to the current values of the respective parameter sets. Disclosure of Invention The specification describes a system implemented as a computer program on one or more computers at one or more locations that selects actions to be performed by a reinforcement learning agent interacting with an environment. According to a first aspect, a system for selecting actions to be performed by an agent interacting with an environment is provided, the system comprising one or more computers and one or more storage devices storing instructions. The instructions, when executed by the one or more computers, cause the one or more computers to implement a linguistic encoder model, an observation encoder neural network, and a subsystem. The language encoder model is configured to receive an input text string in a particular natural language and process the input text string to generate a text insert for the input text string. The observation encoder neural network is configured to receive an input observation characterizing a state of an environment and process the input observation to generate an observation embedding of the input observation. The subsystem is configured to receive a current text string in a particular natural language that expresses information about a current task currently being performed by the agent. The subsystem provides the current text string as input to the speech coder model to obtain a current text embedding of the current text string. The subsystem receives a current observation characterizing a current state of the environment. The subsystem provides the current view as an input to the view encoder neural network to obtain a current view embedding of the current view. The subsystem combines the current observation embedding and the current text embedding to generate a current combined embedding. The subsystem embeds using the current combination, selects actions that the agent is to perform in response to the current observations. In some implementations, the instructions further cause the one or more computers to implement an action selection neural network. The action selection neural network is configured to receive an input combination insert and process the input combination insert to generate an action selection output. In some embodiments, using the current combined embedding, selecting an action to be performed by an agent in response to a current observation includes providing the current combined embedding as an input to an action selection neural network to obtain a current action selection output. Using the current operation selection output, the selection agent responds to the current observation of an action to be performed. In some embodiments, the current action selection output defines a probability distribution over possible actions to be performed by the agent. Selecting an action to be performed by the agent comprises sampling the action from the probability distribution or selecting the action with the highest probability according to the probability distribution. In some embodiments, for each of a plurality of possible actions to be performed by the agent, the current action selection output includes a respective Q value that is an estimate of the return that the agent caused in response to the current observation of performing the possible action. The act of selecting an agent to perform includes an act of selecting the agent having the highest Q value. In some implementations, the current action selection output identifies a best possible action to be performed by the agent in response to the current observation, and selecting the action to be performed by the agent includes selecting the best possible action. In some embodiments, the speech coder model is a recurrent neural netwo