US-12626121-B1 - Execution of machine-trained network

US12626121B1US 12626121 B1US12626121 B1US 12626121B1US-12626121-B1

Abstract

Some embodiments provide a method for executing a machine-trained (MT) network that includes multiple layers. For an input set for the network divided into at least two blocks of input data, the method propagates each respective block of the input data separately through a first set of the layers of the MT network to generate respective blocks of intermediate data. The method combines the blocks of intermediate data into a set of intermediate data. The method propagates the set of intermediate data together through a second set of the layers of the MT network to generate output data for the input set.

Inventors

Justin Tantiongloc
Brian Thomas
Steven L. Teig

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20210311

Claims (20)

1 . A method for executing a machine-trained (MT) network comprising a plurality of layers, the method comprising: receiving and storing input data for the MT network comprising a plurality of input channels; storing, by a first dot product core, a first activation value for a first layer of the MT network in a first memory address of the first dot product core, the first activation value associated with a first coordinate and a first input channel, and storing, by a second dot product core, a second activation value for the first layer of the MT network in a second memory address of the second dot product core, where the second memory address is the same as the first memory address, the second activation value associated with the first coordinate and a second input channel, wherein the first activation value and the second activation value are associated with a first block of the input data; computing, by the first dot product core and the second dot product core and based on the first activation value and the second activation value, a first intermediate value by propagating the first block of input data through a first set of layers of the MT network, wherein the first set of layers comprises a plurality of convolutional layers; computing, by the first dot product core and the second dot product core, a second intermediate value by propagating a second block of input data through the first set of layers of the MT network, wherein the second block of input data is distinct from the first block of input data; combining the first intermediate value and the second intermediate value into a set of intermediate data; and propagating the set of intermediate data together through a second set of layers of the MT network to generate output data for the input data.
2 . The method of claim 1 , wherein the MT network is trained by propagating training input sets together through the entire network without dividing the training input sets into blocks of the input data.
3 . The method of claim 1 , wherein: each of the first set of layers and the second set of layers comprises at least one convolutional layer; and at least one of the first set of layers and the second set of layers comprises a pooling layer.
4 . The method of claim 1 , wherein: the MT network is executed by a neural network inference circuit comprising a fixed amount of memory for weight values and intermediate activation values; and separately propagating blocks of the input data reduces a portion of the memory allocated to the intermediate activation values, thereby increasing a portion of the memory available to the weight values.
5 . The method of claim 4 , wherein the weight values are ternary weight values such that each weight is stored in the memory as one of {0, 1, −1}, wherein increasing the portion of the memory available to the weight values enables an increase in a percentage of the weight values that are non-zero, thereby increasing accuracy of the output data generated by the MT network.
6 . The method of claim 4 , wherein the first set of layers comprises layers with the largest numbers of intermediate activation values in the MT network.
7 . The method of claim 4 , wherein: a block of intermediate data comprising the first intermediate value and the second intermediate value is stored in one or more contiguous blocks of the memory comprising a second block of intermediate data; and combining the first block of intermediate data and the second block of intermediate data comprises moving at least a subset of each block of intermediate data within the memory.
8 . The method of claim 4 , wherein a subset of the fixed amount of the memory of the neural network inference circuit is reused for propagation of each block of the input data through the first set of layers of the MT network.
9 . The method of claim 1 , wherein each channel comprises an equally-sized grid of input values arranged in rows and columns.
10 . The method of claim 9 , wherein each respective block of the input data comprises respective input values from a respective block of rows of the respective input values across all channels.
11 . The method of claim 10 , wherein the respective block of rows overlaps a second block of rows such that a subset of rows belongs to two different input blocks.
12 . The method of claim 11 , wherein the overlap is based on a receptive field within the input values of kernels of a final layer in the first set of layers.
13 . The method of claim 1 , wherein computing the first intermediate value comprises: propagating the first block of the input data through the first set of layers to generate the first block of intermediate data and storing the first block of intermediate data in memory; after storing the first block of intermediate data in memory and prior to combining the first intermediate value and the second intermediate value or propagating the set of intermediate data through the second set of the layers of the MT network, propagating a second block of the input data through the first set of layers to generate the second block of intermediate data and storing the second block of intermediate data in the memory.
14 . A non-transitory machine-readable medium storing a program which when executed by a neural network inference circuit causes the neural network inference circuit to execute a machine-trained (MT) network comprising a plurality of layers, the program comprising sets of instructions for: receiving and storing input data for the MT network comprising a plurality of input channels; storing, by a first dot product core, a first activation value for a first layer of the MT network in a first memory address of the first dot product core, the first activation value associated with a first coordinate and a first input channel, and storing, by a second dot product core, a second activation value for the first layer of the MT network in a second memory address of the second dot product core, where the second memory address is the same as the first memory address, the second activation value associated with the first coordinate and a second input channel, wherein the first activation value and the second activation value are associated with a first block of the input data; computing, by the first dot product core and the second dot product core and based on the first activation value and the second activation value, a first intermediate value by propagating the first block of input data through a first set of layers of the MT network, wherein the first set of layers comprises a plurality of convolutional layers; computing, by the first dot product core and the second dot product core, a second intermediate value by propagating a second block of input data through the first set of layers of the MT network, wherein the second block of input data is distinct from the first block of input data; combining the first intermediate value and the second intermediate value into a set of intermediate data; and propagating the set of intermediate data together through a second set of layers of the MT network to generate output data for the input data.
15 . The non-transitory machine-readable medium of claim 14 , wherein the MT network is trained by propagating training input sets together through the entire network without dividing the training input sets into blocks of the input data.
16 . The non-transitory machine-readable medium of claim 14 , wherein the neural network inference circuit comprises a fixed amount of memory for weight values and intermediate activation values.
17 . The non-transitory machine-readable medium of claim 16 , wherein: separately propagating blocks of the input data reduces a portion of the memory allocated to the intermediate activation values, thereby increasing a portion of the memory available to the weight values; the weight values are ternary weight values such that each weight is stored in the memory as one of {0, 1, −1}; and increasing the portion of the memory available to the weight values enables an increase in a percentage of the weight values that are non-zero thereby increasing accuracy of the output data generated by the MT network.
18 . The non-transitory machine-readable medium of claim 16 , wherein: a block of intermediate data comprising the first intermediate value and the second intermediate value is stored in one or more contiguous blocks of the memory comprising a second block of intermediate data; and the set of instructions for combining each block of intermediate data comprises a set of instructions for moving at least a subset of each block of intermediate data within the memory.
19 . The non-transitory machine-readable medium of claim 14 , wherein: each channel comprises an equally-sized grid of input values arranged in rows and columns; and each respective block of the input data comprises respective input values from a respective block of rows of the respective input values across all channels.
20 . The non-transitory machine-readable medium of claim 19 , wherein: the respective block of rows overlaps a second block of rows such that a subset of rows belongs to two different input blocks; and the overlap is based on a receptive field within the input values of kernels of a final layer in the first set of the plurality of layers.

Description

BACKGROUND Machine learning automates the creation, based on historical data, of models that can then be used to make predictions. A class of models called deep neural networks (or DNNs) has become popular over the last few years, and there is now a menagerie of types of DNNs. Some examples of DNN's include feed-forward, convolutional, recurrent, long-short term memory (LSTM), and Neural Turing Machines (NTM). These neural networks typically involve many weights that are calculated during training and then used when the neural network is embedded into a device. For instance, layers in a ResNet50 network (a known network architecture for image analysis) may have up to 512 3×3 kernels (which may have a depth up to 512) in a single layer, which would include over 2 million weights in a single layer. These weights, along with intermediate activation data, need to be stored on the neural network execution fabric of a device. Recently, techniques have been introduced to solve this issue in part by creating very sparse networks (i.e., with most weight values set to zero), which saves space. However, techniques for reducing the storage of intermediate activation data would also be helpful. BRIEF SUMMARY Some embodiments of the invention provide a method for executing a machine-trained (MT) network (e.g., a neural network) by dividing the input to the network into multiple blocks and propagating each block separately through a portion of the network. In so doing, the maximum amount of intermediate activation values requiring storage at one time is reduced, thereby enabling a larger portion of a fixed amount of memory to be allocated to storage of weight values. Specifically, for a first portion of the network (e.g., a first set of layers of the network), each block of the input data is propagated separately to generate respective blocks of intermediate data. These blocks of intermediate data are then combined, and the combined data is propagated together through a second portion of the network (e.g., a second set of layers) to generate network output data for the input. In some embodiments, the MT network is a convolutional neural network. Such a neural network propagates an input data set through a series of layers (e.g., convolutional layers, pooling layers, element-wise operation layers) to generate output data. Convolutional neural networks are often used for image processing and/or analysis. A typical convolutional layer takes as input a set of input activations arranged in one or more channels, each channel being a two-dimensional grid of activation values (e.g., an input image with 320×240 pixels would be arranged as three 320×240 channels, one channel each for the red, green, and blue channels). One or more (typically many) filters of weight values are convolved over the input activation values to compute dot products, to which additional operations (e.g., shift, scale, non-linear activation function) are applied to generate output activations. Each filter produces a channel of output activations, which are used as inputs to a subsequent layer (typically the next layer of the network). In some embodiments, the weight values and/or activation values within the network are quantized for use on a particular device. Specifically, some embodiments quantize activation values to a particular number of bits (e.g., 4 bits) during the execution of the network. For weight values, some embodiments use binary or ternary weight values. Binary weight values are typically trained such that each weight is either 0 or 1, and ternary weight values are typically trained such that each weight value is one of the set {0, 1, −1}. In either case, the weight values may be multiplied by a scale value determined for a layer or channel. To save memory, some embodiments train the networks to be extremely sparse, with a large majority (e.g., 85%, 90%) of the weights set to 0 (rather than 1 or −1). In some such embodiments, the weights are stored on the device (e.g., in the memory of a neural network inference circuit embedded in the device) in an encoded manner such that zero-value weights require less memory than non-zero weights. These networks can still be very predictive, but at the margins decreasing sparsity (e.g., from 90% to 85%) improves prediction accuracy. The neural network inference circuit of some embodiments that stores the weight and activation values during execution has a fixed amount of available memory for these values. In some embodiments, the neural network inference circuit stores the weights for all layers in memory for the entirety of the execution of the network, while the activation values are only stored as long as they are needed (e.g., for one or two layers in many cases) and then overwritten by activation values of later layers. The weight values are loaded into memory when the neural network inference circuit is initially powered up and remain until the circuit is powered off. When the neural network infer