US-12626181-B2 - Model based reinforcement learning based on generalized hidden parameter Markov decision processes

US12626181B2US 12626181 B2US12626181 B2US 12626181B2US-12626181-B2

Abstract

A machine learning model for reinforcement learning uses parameterized families of Markov decision processes (MDP) with latent variables. The system uses latent variables to improve ability of models to transfer knowledge and generalize to new tasks. Accordingly, trained machine learning based models are able to work in unseen environments or combinations of conditions/factors that the machine learning model was never trained on. For example, robots or self-driving vehicles based on the machine learning based models are robust to changing goals and are able to adapt to novel reward functions or tasks flexibly while being able to transfer knowledge about environments and agents to new tasks.

Inventors

Theofanis Karaletsos
Felipe Petroski Such
Christian Francisco Perez

Assignees

UBER TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20200522

Claims (19)

1 . A computer implemented method comprising: accessing, by a system, a machine learning model for reinforcement learning using Markov decision processes (MDP), the MDP represented using a state space, an action space, a transition function corresponding to a first environment, and a reward function corresponding to a first type of task, wherein the transition function is parameterized by first variations of a first set of latent variables and the reward function is parameterized by first variations of a second set of latent variables, wherein the machine learning model is configured for execution by an agent in an environment, wherein the first set of latent variables or the second set of latent variables comprise (a) a first latent variable corresponding to a hidden parameter representing an attribute of the agent and (b) a second latent variable corresponding to a hidden parameter representing a change in the reward function, wherein the transition function indicates a probability of transitioning from a current state to a new state responsive to an action taken by the agent in the first environment, and the reward function indicates a reward that the agent receives after taking the action and transitioning to the new state in the first environment; training the machine learning model comprising: training based on first variations of the first set of latent variables, and training based on the first variations of the second set of latent variables; and executing the machine learning model to perform a new type of task by the agent in a second environment, executing the machine learning model comprising: determining a set of sequences of proposed actions; for each sequence of proposed actions in the set, determining a sequence of future states responsive to performing the sequence of proposed actions based on the transition function; and determining an expected reward for the sequence of future states based on the reward function; and selecting a subset of the sequences of proposed actions based on their expected rewards; determining a distribution of states or rewards in the second environment; and inferring second variations of the first set of the latent variables or the second set of latent variables in the second environment based on the distribution of states or rewards, the inferred second variations of the first set of latent variables or the second set of latent variables aligning the transition function or the reward function with the second environment.
2 . The computer implemented method of claim 1 , further comprising: initializing a dataset representing transitions; and repeatedly: training the machine learning model using the dataset; and augmenting the dataset using new transitions.
3 . The computer implemented method of claim 1 , wherein the agent represents a robot and the environment represents an obstacle course in which the robot is moving.
4 . The computer implemented method of claim 3 , wherein the robot comprises sensors for capturing data describing environment of the robot, and wherein the machine learning model receives as input, sensor data captured by the sensors of the robot and predicts information describing one or more objects in the environment of the robot.
5 . The computer implemented method of claim 1 , wherein the agent represents a self-driving vehicle and the environment represents traffic through which the self-driving vehicle is moving.
6 . The computer implemented method of claim 5 , wherein the self-driving vehicle has one or more sensors mounted on the self-driving vehicle, and wherein the machine learning model receives as input, sensor data captured by the sensors of the self-driving vehicle and predicts information describing one or more entities in the environment through which the self-driving vehicle is driving.
7 . The computer implemented method of claim 1 , wherein the agent represents a pricing engine for setting pricing for a business and the environment represents the business.
8 . A non-transitory computer readable storage medium storing instructions, the instructions when executed by a processor, cause the processor to perform steps comprising: accessing, by a system, a machine learning model for reinforcement learning using Markov decision processes (MDP), the MDP represented using a state space, an action space, a transition function corresponding to a first environment, and a reward function corresponding to a first type of task, wherein the transition function is parameterized by first variations of a first set of latent variables and the reward function is parameterized by first variations of a second set of latent variables, wherein the machine learning model is configured for execution by an agent in an environment, wherein the first set of latent variables or the second set of latent variables comprise (a) a first latent variable corresponding to a hidden parameter representing an attribute of the agent and (b) a second latent variable corresponding to a hidden parameter representing a change in the reward function, wherein the transition function indicates a probability of transitioning from a current state to a new state responsive to an action taken by the agent in the first environment, and the reward function indicates a reward that the agent receives after taking the action and transitioning to the new state in the first environment; training the machine learning model comprising: training based on first variations of the first set of latent variables, and training based on the first variations of the second set of latent variables; and executing the machine learning model to perform a new type of task by the agent in a second environment, executing the machine learning model comprising: determining a set of sequences of proposed actions; for each sequence of proposed actions in the set, determining a sequence of future states responsive to performing the sequence of proposed actions based on the transition function; and determining an expected reward for the sequence of future states based on the reward function; and selecting a subset of the sequences of proposed actions based on their expected rewards; determining a distribution of states or rewards in the second environment; and inferring second variations of the first set of the latent variables or the second set of latent variables in the second environment based on the distribution of states or rewards, the inferred second variations of the first set of latent variables or the second set of latent variables aligning the transition function or the reward function with the second environment.
9 . The non-transitory computer readable storage medium of claim 8 , further comprising: initializing a dataset representing transitions; and repeatedly: training the machine learning model using the dataset; and augmenting the dataset using new transitions.
10 . The non-transitory computer readable storage medium of claim 8 , wherein the agent represents a robot and the environment represents an obstacle course in which the robot is moving.
11 . The non-transitory computer readable storage medium of claim 10 , wherein the robot comprises sensors for capturing data describing environment of the robot, and wherein the machine learning model receives as input, sensor data captured by the sensors of the robot and predicts information describing one or more objects in the environment of the robot.
12 . The non-transitory computer readable storage medium of claim 8 , wherein the agent represents a self-driving vehicle and the environment represents traffic through which the self-driving vehicle is moving.
13 . The non-transitory computer readable storage medium of claim 12 , wherein the self-driving vehicle has one or more sensors mounted on the self-driving vehicle, and wherein the machine learning model receives as input, sensor data captured by the one or more sensors of the self-driving vehicle and predicts information describing one or more entities in the environment through which the self-driving vehicle is driving.
14 . A computer system comprising: a processor; and a non-transitory computer readable storage medium storing instructions, the instructions when executed by a processor, cause the processor to perform steps comprising: accessing, by a system, a machine learning model for reinforcement learning using Markov decision processes (MDP), the MDP represented using a state space, an action space, a transition function corresponding to a first environment, and a reward function corresponding to a first type of task, wherein the transition function is parameterized by first variations of a first set of latent variables and the first reward function is parameterized by first variations of a second set of latent variables, wherein the machine learning model is configured for execution by an agent in an environment, wherein the first set of latent variables or the second set of latent variables comprise (a) a first latent variable corresponding to a hidden parameter representing an attribute of the agent and (b) a second latent variable corresponding to a hidden parameter representing a change in the reward function, wherein the transition function indicates a probability of transitioning from a current state to a new state responsive to an action taken by the agent in the first environment, and the reward function indicates a reward that the agent receives after taking the action and transitioning to the new state in the first environment; training the machine learning model comprising: training based on first variations of the first set of latent variables, and training based on the first variations of the second set of latent variables; and executing the machine learning model to perform a new type of task by the agent in a second environment, executing the machine learning model comprising: determining a set of sequences of proposed actions; for each sequence of proposed actions in the set, determining a sequence of future states responsive to performing the sequence of proposed actions based on the transition function; and determining an expected reward for the sequence of future states based on the reward function; and selecting a subset of the sequences of proposed actions based on their expected rewards; determining a distribution of states or rewards in the second environment; and inferring second variations of the first set of the latent variables or the second set of latent variables in the second environment based on the distribution of states or rewards, the inferred second variations of the first set of latent variables or the second set of latent variables aligning the transition function or the reward function with the second environment.
15 . The computer system of claim 14 , further comprising: initializing a dataset representing transitions; and repeatedly: training the machine learning model using the dataset; and augmenting the dataset using new transitions.
16 . The computer system of claim 14 , wherein the agent represents a robot and the environment represents an obstacle course in which the robot is moving.
17 . The computer system of claim 16 , wherein the robot comprises sensors for capturing data describing environment of the robot, and wherein the machine learning model receives as input, sensor data captured by the sensors of the robot and predicts information describing one or more objects in the environment of the robot.
18 . The computer system of claim 14 , wherein the agent represents a self-driving vehicle and the environment represents traffic through which the self-driving vehicle is moving.
19 . The computer system of claim 18 , wherein the self-driving vehicle has one or more sensors mounted on the self-driving vehicle, and wherein the machine learning model receives as input, sensor data captured by the one or more sensors of the self-driving vehicle and predicts information describing one or more entities in the environment through which the self-driving vehicle is driving.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 62/851,858, filed May 23, 2019, which is incorporated by reference in its entirety. BACKGROUND 1. Technical Field The subject matter described generally relates to artificial intelligence and machine learning, and in particular to deep reinforcement learning techniques based on Markov decision processes. 2. Background Information Artificial intelligence is used for performing complex tasks, for example, natural language processing, computer vision, speech recognition, bioinformatics, recognizing patterns in images, and so on. Artificial intelligence techniques used for these tasks include machine learning based models, for example, neural networks. One such application of artificial intelligence is in reinforcement learning based systems, for example, systems that monitor their environment and take appropriate actions to achieve a task. Examples of such systems include a robot monitoring its surroundings using a camera and navigating through an obstacle course or an autonomous vehicle monitoring the road traffic using various sensors including cameras and LIDAR (light detection and ranging) sensors and driving through traffic on a road. Such systems need to be able to operate in various environment and under varying conditions. For example, a robot should be able to work under varying environments such as clear conditions, rainy conditions, icy conditions, and so on. Furthermore, internal factors related to the robot may affect its operation, for example, rust in joints, certain faults in components, improper servicing of components and so on. A robot using a model trained under one set of conditions may not be able to operate in different set of conditions. Conventional techniques require such models to be trained under all possible conditions in which they operate. This requires a huge amount of training data that may be very difficult to obtain. As a result, these techniques are inefficient in terms of training of the models. SUMMARY Embodiments use parametrized families of generalized hidden parameter Markov decision process (GHP-MDPs) based models with structured latent spaces. Use of latent spaces provides improved ability to transfer knowledge, generalize to new tasks and handle combinatorial problems. Accordingly, trained models are able to work in unseen environments or combinations of conditions/factors that the model was never trained on. Embodiments are used in various applications of reinforcement learning based models, for example, models used by robots or self-driving vehicles. Embodiments allow robots to be robust to changing goals and allow them to adapt to novel reward functions or tasks flexibly while being able to transfer knowledge about environments and agents to new tasks. Other embodiments can use the disclosed techniques for other applications for example, self-driving vehicles. According to an embodiment, a system accesses a machine learning model for reinforcement learning. The machine learning model is based on Markov decision processes (MDP) represented using a state space, an action space, a transition function, and a reward function. The transition function and the reward function are parameterized by sets of latent variables. The machine learning model is configured for execution by an agent in an environment. Each hidden parameter corresponds to one or more of: (a) a factor representing an environment in which the machine learning model is executed, or (b) an attribute of an agent executing the machine learning based model. The machine learning model is trained based on variations of the set of latent variables corresponding to the transition function and the reward function. The trained machine learning model is executed in a new environment. The execution of the machine learning model is based on a combination of latent variables from the sets of latent variables corresponding to the transition function and the reward function that is distinct from combinations of latent variables used during training of the machine learning based model. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a networked computing environment 100 in which deep reinforcement learning may be used, according to an embodiment. FIG. 2 illustrates a system for training and using deep reinforcement learning based models, according to one embodiment. FIG. 3 illustrates a model based on a Markov decision process with structured latent variables for dynamics, agent variation and reward functions, according to one embodiment. FIG. 4 illustrates a flowchart illustrating a process for training and using deep reinforcement learning based models, according to one embodiment. FIG. 5 is a high-level block diagram illustrating an example of a computer suitable for use in the system environment of FIG. 1-2, according to one embodiment. The Figures (FIGS.) and the following description describe certain