EP-4492291-B1 - DISTRIBUTED TRAINING USING OFF-POLICY ACTOR-CRITIC REINFORCEMENT LEARNING

EP4492291B1EP 4492291 B1EP4492291 B1EP 4492291B1EP-4492291-B1

Inventors

SOYER, Hubert Josef
HARLEY, Timothy James Alexander
DUNNING, Iain
ESPEHOLT, LASSE
SIMONYAN, Karen
DORON, Yotam
FIROIU, Vlad
MNIH, VOLODYMYR
KAVUKCUOGLU, KORAY
MUNOS, Remi
WARD, THOMAS

Dates

Publication Date: 20260513
Application Date: 20190205

Claims (12)

A method performed by one or more computers, the method comprising: jointly training an action selection neural network (302) and a state value neural network (306) of a reinforcement learning system (300), wherein: the action selection neural network (102; 302) is configured to process an observation (112) of an environment (108), in accordance with current values of a set of action selection neural network parameters, to generate an output that defines respective learner policy scores (110; 318) for each action (104) in a predetermined set of actions that can be performed by an agent (106) to interact with the environment; the state value neural network (306) is configured to process an input comprising an observation of the environment to generate a state value for the observation that defines an estimate of a cumulative reward that will be received by the agent, starting from a state of the environment represented by the observation, by selecting actions using a current action selection policy defined by the current values of the set of action selection neural network parameters; training the action selection neural network (302) and the state value neural network (306) comprising iteratively, over multiple iterations, updating values of the action selection network parameters (308) and the values of the state value network parameters (310); the training comprising: obtaining an off-policy experience tuple trajectory (304) that characterizes interaction of the agent with the environment over a sequence of time steps as the agent performed actions selected in accordance with an off-policy action selection policy that is different than the current action selection policy; processing each observation in the off-policy experience tuple trajectory (304) using the state value neural network (306) to generate a respective state value (316); processing each action in the off-policy experience tuple trajectory (304) using the action selection neural network (302) to generate a respective learner policy score (318); and determining state value network parameter updates (314) and action selection network parameter updates (312) from the respective state values and learner policy scores; wherein each experience tuple in the off-policy experience tuple trajectory (304) includes a respective behavior policy score that was assigned to the selected action of the experience tuple by an off-policy action selection policy when the action was selected, and the training takes account of discrepancies between the learner policy scores (318) and behavior policy scores of the actions in the off-policy experience tuple trajectory (304); and wherein the action selection neural network (102; 302) is used to select actions (104) to be performed by the agent (106) interacting with the environment (108) at each of multiple time steps, at each time step the action selection network (102; 302) processing an observation (112) characterizing a current state of the environment (108) to generate policy scores (110) that are used to select the action (104) to be performed by the agent (106) in response to the observation, and wherein: i) the environment is a real-world environment, the actions are performed by the agent in the real-world environment, the observation (112) characterizes a current state of the real-world environment, and the agent is a mechanical agent interacting with the real-world environment to perform a task in the real-world environment; or ii) the environment is a protein folding environment and a state of the environment is a respective state of a protein chain, the observation (112) characterizes a current state of the protein chain, the agent is a computer system for determining how to fold the protein chain, and the actions are folding actions for folding the protein chain, to fold the protein so that the protein is stable and achieves a particular biological function; or iii) the environment is a drug design environment and a state of the environment is a respective state of a potential pharma chemical drug, the observation (112) characterizes a state of the potential pharma chemical drug, the agent is a computer system for determining elements of the pharma chemical drug or a synthetic pathway for the pharma chemical drug, and the actions are for determining the elements of or synthetic pathway for the pharma chemical drug; or iv) the environment is a real-world environment of a data center or manufacturing plant or service facility, including items of equipment, the observation (112) characterizes power or water usage by equipment in the environment, the agent controls actions in the real-world environment to reduce the power or water usage, and the actions control or imposing operating conditions on the items of equipment; v) the environment is a real-world environment of a grid mains power distribution system including items of equipment, the observation (112) characterizes power generation or distribution control, the agent controls actions in the real-world environment to increase efficiency by controlling or imposing operating conditions on the items of equipment; or vi) the environment is a real-world computing environment, the observation (112) characterizes the real-world computing environment, the agent manages distribution of tasks across computing resources in the environment, and the actions assign computing tasks to particular computing resources.
The method of claim 1, comprising: determining a state value target that defines a state value that should be generated by the state value neural network (306) by processing a first observation in the off-policy experience tuple trajectory; and determining the state value network parameter updates (314) based on a gradient of a loss function that characterizes a discrepancy between the state value target and a corresponding state value generated by the state value network (306); and wherein training taking account of discrepancies between the learner policy scores (318) and behavior policy scores of the actions in the off-policy experience tuple trajectory (304) is by determining the state value target using a correction factor to correct for differences between the behavior policy used to select the actions of the off-policy experience tuple trajectory and the current action selection policy.
The method of claim 2, comprising determining the correction factor based on each experience tuple in the off-policy experience tuple trajectory (304).
The method of claim 3, wherein the correction factor comprises a trace coefficient for the experience tuple based on a ratio of the learner policy score for the selected action in the experience tuple and the behavior policy score for the selected action in the experience tuple.
The method of claim 4, further comprising, for each experience tuple: determining a correction factor for the experience tuple based on the trace coefficient for the experience tuple, and the trace coefficients for any experience tuples that precede the experience tuple in the trajectory, and determining a state value temporal difference for the experience tuple that represents a difference between the state value of the observation and the state value of a subsequent observation in the experience tuple; and determining the state value target for the off-policy experience tuple trajectory based (i) the correction factors, (ii) the state value temporal differences, and (iii) the state value for the observation included in the first experience tuple in the trajectory.
The method of claim 5, wherein determining the correction factor for the given experience tuple comprises: truncating the trace coefficient for the experience tuple at a first truncation value; and truncating the trace coefficients for any experience tuples that precede the experience tuple at a second truncation value.
The method of claim 6, wherein the first truncation value is greater than or equal to the second truncation value.
The method of any of claims 2-7, comprising determining the state value target as: v = V x 0 + ∑ t = 0 n − 1 γ t ⋅ C t ⋅ δ t V where V ( x 0 ) is the state value for the observation included in the first experience tuple in the off-policy experience tuple trajectory, n is the total number of experience tuples in the off-policy experience tuple trajectory, t indexes the experience tuples in the off-policy experience tuple trajectory, γ is a discount factor, C t is the correction factor for the t -th experience tuple, and δ t V is the state value temporal difference for the t -th experience tuple.
The method of any preceding claim, wherein determining the action selection network parameter updates comprises: updating the current parameter values of the action selection network based on the gradient of the learner policy score for the selected action included in the first experience tuple with respect to the current parameter values of the action selection network, wherein the learner policy score for the selected action included in the first experience tuple is the score assigned to the selected action by the learner policy scores generated by the action selection network (302) for the observation included in the first experience tuple.
The method of any preceding claim, wherein each experience tuple in the off-policy experience tuple trajectory (304) corresponds to a respective time step and comprises: (i) the observation at the time step, (ii) the action that was selected to be performed by the agent at the time step, (iii) the respective behavior policy score, (iv) a subsequent observation characterizing a subsequent state of the environment subsequent to the agent performing the selected action, and (v) a reward received subsequent to the agent performing the selected action.
A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the respective operations of the method of any one of claims 1-10.
One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of the method of any one of claims 1-10.

Description

BACKGROUND This specification relates to reinforcement learning. An agent can interact with an environment by performing actions that are selected in response to receiving observations that characterize the current state of the environment. The action to be performed by the agent in response to receiving a given observation can be determined in accordance with the output of a neural network. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. WO2016/127045 describes a distributed system for Deep Q-learning using multiple actors and learners. SUMMARY This specification describes a distributed training system implemented as computer programs on one or more computers in one or more locations that can train an action selection network using off-policy actor-critic reinforcement learning techniques. The invention is set out in claims 1, 11 and 12; further aspects are defined in the dependent claims. The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example data flow for using an action selection network to select actions to be performed by an agent interacting with an environment.FIG. 2 shows an example training system.FIG. 3 shows an example reinforcement learning system.FIG. 4 is a flow diagram of an example of an iterative process for training an action selection network and a state value network using an off-policy actor-critic reinforcement learning technique.FIG. 5 is a flow diagram of an example process for determining a state value target for a state value network. Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION This specification describes a distributed training system and a reinforcement learning system. The distributed training system separates acting from learning by using multiple actor computing units to generate experience tuple trajectories which are processed by one or more learner computing units to train an action selection network. The reinforcement learning system implements an off-policy actor-critic reinforcement learning technique, that is, an actor-critic reinforcement learning technique that can be used to train an action selection network based on off-policy experience tuple trajectories. FIG. 1 illustrates an example data flow 100 for using an action selection neural network 102 to select actions 104 to be performed by an agent 106 interacting with an environment 108 at each of multiple time steps. At each time step, the action selection network 102 processes data characterizing the current state of the environment 108, e.g., an image of the environment 108, to generate policy scores 110 that are used to select an action 104 to be performed by the agent 106 in response to the received data. Data characterizing a state of the environment 108 will be referred to in this specification as an observation. At each time step, the state of the environment 108 at the time step (as characterized by the observation 112) depends on the state of the environment 108 at the previous time step and the action 104 performed by the agent 106 at the previous time step. At each time step, the agent 106 receives a reward 114 based on the current state of the environment 108 and the action 104 of the agent 106 at the time step. In general, the reward 114 is a numerical value. The reward 114 can be based on any event or aspect of the environment. For example, the reward 114 may indicate whether the agent 106 has accomplished a task (e.g., navigating to a target location in the environment 108) or the progress of the agent 106 towards accomplishing a task. In some implementations, the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment. For example, the agent may be a robot interacting with the environment to accomplish a specific task, e.g., to locate an object of interest in the environment or to move an object of interest to a specified location in the environment or to navigate to a specified destination in the environment; or the agent may be an autonomous or semi-autonomous land or air or sea vehicle navigating through the environment. In these implementations, the observations may include, for example, one or more of images,