US-12626123-B2 - Controlling operation of actor and learner computing units based on a usage rate of a replay memory

US12626123B2US 12626123 B2US12626123 B2US 12626123B2US-12626123-B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection neural network used to select actions to be performed by an agent to interact with an environment. In one aspect, a system comprises: a plurality of actor computing units; a replay memory that stores trajectories generated by the plurality of actor computing units; one or more learner computing units; and a control subsystem that is configured to perform operations comprising: determining a usage rate of the replay memory; and in response to determining that the usage rate of the replay memory is outside a range of allowable usage rates: preventing each of one or more of the actor computing units from generating trajectories, or preventing each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network.

Inventors

Albin Rasmus Cassirer

Assignees

GDM HOLDING LLC

Dates

Publication Date: 20260512
Application Date: 20210513

Claims (20)

1 . A system for training an action selection neural network used to select actions to be performed by an agent to interact with an environment, the system comprising: a plurality of actor computing units, wherein each actor computing unit is configured to control interaction of a respective instance of the agent with a respective instance of the environment to generate trajectories for use in training the action selection neural network; a replay memory that stores trajectories generated by the plurality of actor computing units; one or more learner computing units, wherein each learner computing unit is configured to train the action selection neural network on trajectories selected from the replay memory; and a control system that is configured to autonomously: determine a usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network; and in response to determining that the usage rate of the replay memory is outside a range of allowable usage rates: prevent each of one or more of the actor computing units from generating trajectories, or prevent each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network.
2 . The system of claim 1 , wherein determining the usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network comprises: determining the usage rate of the replay memory based on: (i) a respective number of times that each trajectory that has been stored in the replay memory has been sampled by learner computing unit for use in training the action selection neural network, and (ii) a number of trajectories that have been stored in the replay memory.
3 . The system of claim 2 , wherein determining the determining the usage rate of the replay memory based on: (i) a respective number of times that each trajectory that has been stored in the replay memory has been sampled by the learner computing units for use in training the action selection neural network, and (ii) a number of trajectories that have been stored in the replay memory, comprises: determining the usage rate of the replay memory based on a ratio between: (i) a sum of the respective number of times that each trajectory that has been stored in the replay memory has been sampled by the learner computing units for use in training the action selection neural network, and (ii) the number of trajectories that have been stored in the replay memory.
4 . The system of claim 1 , wherein the control system is configured to: determine that the usage rate of the replay memory is outside the range of allowable usage rates based on the usage rate of the replay memory being below a lower endpoint of the allowable range of usage rates; and prevent each of one or more of the actor computing units from generating trajectories.
5 . The system of claim 1 , wherein determining the usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network comprises: receiving a request from an actor computing unit to store one or more trajectories generated by the actor computing unit in the replay memory; and determining the usage rate of the replay memory as a usage rate that would result from storing the one or more trajectories generated by the actor computing unit in the replay memory.
6 . The system of claim 5 , wherein the control system is configured to: prevent the actor computing unit from storing the one or more trajectories generated by the actor computing unit the replay memory.
7 . The system of claim 1 , wherein the control system is configured to: determining that the usage rate of the replay memory is outside the range of allowable usage rates based on the usage rate of the replay memory being above an upper endpoint of the allowable range of usage rates; and preventing each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network.
8 . The system of claim 1 , wherein determining the usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network comprises: receiving a request from a learner computing unit to sample one or more trajectories from the replay memory; and determining the usage rate of the replay memory as a usage rate that would result from the learner computing unit sampling the one or more trajectories from the replay memory.
9 . The system of claim 1 , wherein the control system is configured to: determining that the usage rate of the replay memory is outside the range of allowable usage rates based on the usage rate of the replay memory being below a lower endpoint of the allowable range of usage rates; and determining that one or more learner computing units that were previously prevented from sampling trajectories from the replay memory for use in training the action selection neural network should resume sampling trajectories from the replay memory for use in training the action selection neural network.
10 . The system of claim 1 , wherein the control system is configured to: determining that the usage rate of the replay memory is outside the range of allowable usage rates based on the usage rate of the replay memory being above an upper endpoint of the allowable range of usage rates; and determining that one or more actor computing units that were previously prevented from generating trajectories should resume generating trajectories.
11 . The system of claim 1 , wherein the system comprises at least one hundred actor computing units.
12 . The system of claim 1 , wherein the control system continuously monitors the usage rate of the replay memory to maintain the usage rate of the replay memory within the range of allowable usage rates.
13 . The system of claim 1 , further comprising a memory management system that is configured to: receive an original trajectory generated by an actor computing unit; store a plurality of trajectory data elements representing the original trajectory in respective slots of the replay memory; subdivide the original trajectory into a plurality of overlapping new trajectories; instantiate a respective trajectory representation of each new trajectory in the replay memory as a sequence of pointers that each point to a respective slot of the replay storing a trajectory data element of the original trajectory; and make each of the trajectory representations representing the new trajectories available to the learner computing units for sampling from the replay memory.
14 . The system of claim 13 , wherein each of the plurality of trajectory data elements representing the original trajectory corresponds to a respective time step in the original trajectory and includes data representing interaction of an agent with an environment at the time step.
15 . The system of claim 14 , wherein each of the plurality of trajectory data elements includes data representing: (i) an observation of the environment at the corresponding time step, (ii) an action performed by the agent at the corresponding time step, and (iii) a reward received by the agent at the corresponding time step.
16 . The system of claim 13 , wherein storing the plurality of trajectory data elements representing the original trajectory in respective slots in the replay memory comprises: compressing each of the plurality of trajectory data elements.
17 . The system of claim 13 , wherein the memory management subsystem is configured to: track, for each trajectory data element stored in the replay memory, a respective number of times that a slot storing the trajectory data element is referenced by pointers from trajectory representations stored in the replay memory; and determine that one or more of the trajectory data elements stored in the replay memory should be removed from the replay memory based on the number of times that the slot storing the trajectory data element is referenced by pointers from trajectory representations stored in the replay memory.
18 . The system of claim 17 , wherein determining that one or more of the trajectory data elements stored in the replay memory should be removed from the replay memory comprises: determining that any trajectory data element that is not referenced by any pointers from trajectory representations stored in the replay memory should be removed from the replay memory.
19 . A method performed by one or more data processing apparatus for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the method comprising: maintaining a replay memory that stores trajectories generated by a plurality of actor computing units, wherein each actor computing unit is configured to control interaction of a respective instance of the agent with a respective instance of the environment to generate trajectories for use in training the action selection neural network, wherein each of one or more learner computing units is configured to train the action selection neural network on trajectories selected from the replay memory; determining a usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network; and in response to determining that the usage rate of the replay memory is outside a range of allowable usage rates: preventing each of one or more of the actor computing units from generating trajectories, or preventing each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network.
20 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training an action selection neural network used to select actions to be performed by an agent interacting with an environment, the operations comprising: maintaining a replay memory that stores trajectories generated by a plurality of actor computing units, wherein each actor computing unit is configured to control interaction of a respective instance of the agent with a respective instance of the environment to generate trajectories for use in training the action selection neural network, wherein each of one or more learner computing units is configured to train the action selection neural network on trajectories selected from the replay memory; determining a usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network; and in response to determining that the usage rate of the replay memory is outside a range of allowable usage rates: preventing each of one or more of the actor computing units from generating trajectories, or preventing each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network.

Description

BACKGROUND This specification relates to processing data using machine learning models. Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model. Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output. SUMMARY This specification describes a training system implemented as computer programs on one or more computers in one or more locations that trains an action selection neural network used to control an agent that is interacting with an environment. As used throughout this specification, a computing unit may be, e.g., a computer, a core within a computer having multiple cores, or other hardware or software, e.g., a dedicated thread, within a computer capable of independently performing operations. The computing units may include processor cores, processors, microprocessors, special-purpose logic circuitry, e.g., an FPGA (field-programmable gate array) or an ASIC (application-specific integrated circuit), or any other appropriate computing units. In some examples, the computing units are all the same type of computing unit. In other examples, the computing units may be different types of computing units. For example, one computing unit may be a central processing unit (CPU) while other computing units may be graphics processing units (GPUs). As used throughout this specification, a trajectory refers to data characterizing interaction of an agent with an environment over a sequence of one or more time steps. More specifically, for each time step in a sequence of time steps, a trajectory can represent: (i) an observation characterizing the state of the environment at the time step, (ii) an action performed by the agent at the time step, and (iii) a reward received at the time step. A trajectory can represent other data as well, e.g., a discount factor for each time step, a state of an action selection neural network being used to control the agent at each time step (e.g., where the state can be a hidden state of a recurrent action selection neural network, e.g., a cell state of a long short-term memory (LSTM) network), or both. As used throughout this specification, a “memory,” e.g., a replay memory, can be a physical data storage device or a logical data storage area. As used throughout this specification, a “reinforcement learning technique” can refer to any appropriate reinforcement learning training technique, e.g., a Q learning technique or a policy gradient technique. Training an action selection neural network using a reinforcement learning technique can refer to backpropagating gradients of a reinforcement learning objective function through the action selection neural network to adjust the parameter values of the action selection neural network. Training an action selection neural network using a reinforcement learning technique can increase a cumulative measure of rewards (e.g., a time-discounted sum of rewards) received by an agent by performing actions selected using the action selection neural network. According to one aspect there is provided a system for training an action selection neural network used to select actions to be performed by an agent to interact with an environment, the system comprising: a plurality of actor computing units, wherein each actor computing unit is configured to control interaction of a respective instance of the agent with a respective instance of the environment to generate trajectories for use in training the action selection neural network; a replay memory that stores trajectories generated by the plurality of actor computing units; one or more learner computing units, wherein each learner computing unit is configured to train the action selection neural network on trajectories selected from the replay memory; and a control subsystem that is configured to: determine a usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training the action selection neural network; and in response to determining that the usage rate of the replay memory is outside a range of allowable usage rates: prevent each of one or more of the actor computing units from generating trajectories, or prevent each of one or more of the learner computing units from sampling trajectories from the replay memory for use in training the action selection neural network. In some implementations, determining the usage rate of the replay memory characterizing how frequently trajectories from the replay memory have been sampled for use in training