US-12619879-B2 - Training neural networks using learned optimizers

US12619879B2US 12619879 B2US12619879 B2US 12619879B2US-12619879-B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network. One of the methods includes performing, using a plurality of training examples, a training step to obtain respective gradients of a loss function with respect to each of the parameters in the parameter tensors; obtaining a validation loss for a plurality of validation examples that are different from the plurality of training examples generating an optimizer input from at least the respective gradients and the validation loss; processing the optimizer input using an optimizer neural network to generate an output defining a respective update for each of the parameters in the parameter tensors of the neural network; and for each of the parameters in the parameter tensors, applying the respective update to a current value of the parameter to generate an updated value for the parameter.

Inventors

Luke Shekerjian Metz
Niruban Maheswaranathan
Christian Daniel Freeman
Benjamin Poole
Jascha Narain Sohl-Dickstein

Assignees

GOOGLE LLC

Dates

Publication Date: 20260505
Application Date: 20210921

Claims (20)

1 . A method of training a first neural network configured to perform a machine learning task by processing a network input in accordance with at least a set of parameter tensors each including a plurality of respective parameters to generate a network output for the machine learning task, the method comprising repeatedly performing operations comprising: performing, using a plurality of training examples, a training step to obtain respective gradients of a loss function for the machine learning task with respect to each of the parameters in the parameter tensors of the first neural network; obtaining a validation loss that measures a performance of the first neural network on the machine learning task for a plurality of validation examples that are different from the plurality of training examples used to obtain the respective gradients; processing an optimizer input that includes i) features derived from the respective gradients of the loss function for the machine learning task obtained using the training examples and ii) features that are computed from the validation loss that measures a performance of the first neural network on the plurality of validation examples that are different from the plurality of training examples using an optimizer neural network to generate an output defining a respective update for each of the parameters in the parameter tensors of the first neural network, wherein the respective updates automatically regularize the training of the first neural network and reduce time required to train the first neural network; and for each of the parameters in the parameter tensors, applying the respective update to a current value of the parameter to generate an updated value for the parameter.
2 . The method of claim 1 , further comprising: generating, from results of the training step, training data for training the optimizer neural network; and performing a training step to train the optimizer neural network on the training data to optimize an objective that measures a performance of the optimizer neural network in generating at least the respective updates.
3 . The method of claim 2 , wherein the objective measures (i) the performance of the optimizer neural network in generating the respective updates and (ii) a performance of the optimizer neural network in generating updates during training a plurality of other neural networks to perform a plurality of other machine learning tasks.
4 . The method of claim 2 , wherein performing the training step comprises performing one or more iterations of an evolution strategies (ES) technique to optimize the objective.
5 . The method of claim 1 , wherein the optimizer neural network has been trained to optimize an objective that measures a quality of parameter updates generated by the optimizer neural network for a plurality of machine learning tasks that does not include the machine learning task.
6 . The method of claim 1 , wherein the optimizer neural network comprises: (i) a per-tensor neural network that operates independently for each of the parameter tensors, and (ii) a per-parameter neural network that operates independently for each of the plurality of parameters of each of the parameter tensors.
7 . The method of claim 6 , wherein the per-tensor neural network is a recurrent neural network and the per-parameter neural network is a feedforward neural network.
8 . The method of claim 7 , wherein the per-parameter neural network is a multi-layer perceptron.
9 . The method of claim 6 , wherein the per-parameter neural network generates, for each parameter, an output that comprises (i) a direction for the parameter update for the parameter and (ii) a magnitude value for the parameter update for the parameter.
10 . The method of claim 9 , further comprising: for each parameter, generating the update, comprising exponentiating the magnitude value for the parameter to generate an exponentiation and multiplying the exponentiation by the direction for the parameter to generate a product.
11 . The method of claim 10 , wherein generating the update further comprises applying gradient clipping to the product to generate the update.
12 . The method of claim 6 , wherein the optimizer input comprises a respective tensor input for each of the parameter tensors and a respective parameter input for each of the parameters of each of the parameter tensors.
13 . The method of claim 12 , wherein generating the optimizer input comprises: generating the tensor input for each of the parameter tensors from at least (i) the validation loss and (ii) gradients for the parameters in the parameter tensor.
14 . The method of claim 13 , wherein generating the tensor input for each of the parameter tensors further comprises generating the tensor input from at least (iii) a training loss for the training step for the plurality of training examples.
15 . The method of claim 12 , wherein generating the tensor input for each of the parameter tensors further comprises generating the tensor input from at least (iv) outputs generated by the per-tensor neural network when updating the parameters at a preceding training step.
16 . The method of claim 12 wherein generating the tensor input for each of the parameter tensors further comprises generating the tensor input from at least (v) outputs generated by the per-parameter neural network for the parameters in the corresponding parameter tensor when updating the parameters at a preceding training step.
17 . The method of claim 12 , wherein generating the optimizer input comprises: generating the parameter input for each of the parameters from at least (i) the gradient for the parameter and (ii) an output of the per-tensor neural network generated by processing the corresponding tensor input for the parameter tensor to which the parameter belongs.
18 . The method of claim 17 , wherein generating the parameter input for each of the parameters further comprises generating the parameter input from at least (iii) a current value of the parameter.
19 . The method of claim 1 , wherein: the optimizer neural network generates updates to the parameters at each training step in a sequence of training steps, and the validation loss is updated after each of a proper subset of the training steps.
20 . One or more non-transitory computer-readable media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for training a first neural network configured to perform a machine learning task by processing a network input in accordance with at least a set of parameter tensors each including a plurality of respective parameters to generate a network output for the machine learning task, the operations comprising: performing, using a plurality of training examples, a training step to obtain respective gradients of a loss function for the machine learning task with respect to each of the parameters in the parameter tensors of the first neural network; obtaining a validation loss that measures a performance of the first neural network on the machine learning task for a plurality of validation examples that are different from the plurality of training examples used to obtain the respective gradients; processing an optimizer input that includes i) features derived from the respective gradients of the loss function for the machine learning task obtained using the training examples and ii) features that are computed from the validation loss that measures a performance of the first neural network on the plurality of validation examples that are different from the plurality of training examples using an optimizer neural network to generate an output defining a respective update for each of the parameters in the parameter tensors of the first neural network, wherein the respective updates automatically regularize the training of the first neural network and reduce time required to train the first neural network; and for each of the parameters in the parameter tensors, applying the respective update to a current value of the parameter to generate an updated value for the parameter.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Application No. 63/081,269, filed on Sep. 21, 2020. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND This specification relates to training neural networks. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. Some neural networks are recurrent neural networks. A recurrent neural network is a neural network that receives an input sequence and generates an output sequence from the input sequence. In particular, a recurrent neural network can use some or all of the internal state of the network from a previous time step in computing an output at a current time step. An example of a recurrent neural network is a long short term (LSTM) neural network that includes one or more LSTM memory blocks. Each LSTM memory block can include one or more cells that each include an input gate, a forget gate, and an output gate that allow the cell to store previous states for the cell, e.g., for use in generating a current activation or to be provided to other components of the LSTM neural network. SUMMARY This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a trainee neural network that is configured to perform a particular machine learning task using an optimizer neural network that generates outputs that specify updates to the parameters of the trainee neural network. Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. This specification generally describes an optimizer neural network that determines updates to parameter values of another neural network during the training of the other neural network. By using the described optimizer neural network to determine the updates, i.e., instead of an optimization rule or a different optimizer neural network, the training of the other neural network can be improved, resulting in the other network being trained to have improved performance on the machine learning task, the training consuming fewer computational resources, or both. More specifically, because the optimizer neural network has access to additional features, i.e., in addition to features derived from gradients or parameter values, such as validation loss, the updates automatically regularize the training of the other neural network without needing any additional regularization terms or manually specified regularizers. The update steps that are generated by the optimizer exhibit behaviors that are distinct from those generated by the existing optimizers, e.g., first-order optimizers. For example, the described optimizer neural network generates update steps that do not necessarily move in the direction of the gradient, have implicit regularization, adapt as the problem hyperparameters (e.g., batch size) or architecture (e.g., neural network width) change, that have different step sizes per layer of the neural network, and more. All of these features can serve to tailor the updates to the specific other neural network that is being trained, greatly improving the performance of the trained other neural network, reducing the wall clock time required to train the other neural network, or both. Moreover, after being trained, the optimizer neural network generalizes to a wide variety of problems and network architectures, and does not require any user-specified hyperparameters. The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows an example neural network training system. FIG. 2 is a flow diagram of an example process for performing a training step during the training the trainee neural network. FIG. 3 shows the operation of the optimizer neural network at a training step during the training of the trainee neural network. FIG. 4 is a flow diagram of an example process for performing a training step during the training of the optimizer neural network. Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION This specification describes a system implemented as computer programs on one or more comp