CN-122021735-A - Neural network instruction set architecture

CN122021735ACN 122021735 ACN122021735 ACN 122021735ACN-122021735-A

Abstract

The application relates to a neural network instruction set architecture. A computer-implemented method includes receiving, by a processing unit, an instruction specifying a data value for performing tensor computation. In response to receiving the instruction, the method may include performing, by the processing unit, tensor computation by performing a loop nest comprising a plurality of loops, wherein a structure of the loop nest is defined based on one or more of the data values of the instruction. The tensor calculation can be at least part of the calculation of the neural network layer. The data values specified by the instructions can include at least a value specifying a type of neural network layer, and the structure of the loop nest can be defined at least in part by the type of neural network layer.

Inventors

RAVI NARAYANASWAMI
YU TONGHE
OLIVIER TAMMAM
Hasit Haitan

Assignees

谷歌有限责任公司

Dates

Publication Date: 20260512
Application Date: 20170929
Priority Date: 20161027

Claims (20)

1. A method for accelerating tensor computation of a neural network comprising a plurality of neural network layers, the method comprising: Obtaining a single instruction by a controller of a hardware computing module, the single instruction encoding a plurality of data values; Identifying, by the controller, an opcode in the single instruction, the opcode specifying a tensor operation for a layer type of a neural network layer of the plurality of neural network layers, the tensor operation traversing a multidimensional input tensor using a circularly nested set of instructions, wherein two or more instructions of the circularly nested set of instructions are independently executable by the controller such that tensor computation is performed, comprising: traversing a plurality of elements along a plurality of dimensions of the multi-dimensional input tensor based on the single instruction to obtain inputs stored at respective memory locations corresponding to the plurality of elements; executing a respective subset of operations in the loop-nested instruction set to process the inputs stored at the respective memory locations through the neural network layer, and An output of the neural network layer is generated by the hardware computation module based on the plurality of data values, the layer type specified by the opcode, and the input corresponding to the plurality of elements of the multi-dimensional input tensor.
2. The method of claim 1, wherein generating the output comprises: Performing a first portion of the tensor computation based on the single instruction; processing an input set through the neural network layer including the layer type specified by the opcode in response to the first portion performing the tensor calculation, and The output is generated based on the set of inputs processed by the neural network layer.
3. The method according to claim 2, wherein: The input set derived from the multi-dimensional input tensor, and Each element of the plurality of elements corresponds to a respective input in the input set.
4. The method of claim 2, further comprising: determining that the layer type of the neural network layer is a convolutional layer type corresponding to a convolutional neural network layer based on the opcode, and Determining that the tensor calculation is for a convolution operation performed at the convolutional neural network layer based on the opcode.
5. The method of claim 2, wherein performing the tensor calculation comprises: Traversing a first plurality of elements at a first dimension of the multi-dimensional input tensor based on the single instruction.
6. The method of claim 5, wherein the multi-dimensional input tensor is an activation tensor, and the method further comprises: Based on the single instruction, a plurality of activations of the activation tensor are preloaded into a first memory of a computing unit receiving the single instruction.
7. The method of claim 6, wherein traversing the first plurality of elements at the first dimension comprises: based on the single instruction, accessing a plurality of address locations of the first memory, Wherein each address location of the plurality of address locations corresponds to a respective element of the activation tensor along the first dimension of the activation tensor.
8. The method of claim 6, wherein the computing unit is instructed based on the single instruction to perform only a subset of the total computation amount required to traverse the multi-dimensional input tensor.
9. The method of claim 1, wherein the plurality of elements along the plurality of dimensions of the multi-dimensional input tensor comprise: a first plurality of elements at an x-dimension of the multi-dimensional input tensor; A second plurality of elements at the y-dimension of the multi-dimensional input tensor, and A third plurality of elements at the z-dimension of the multi-dimensional input tensor.
10. The method of claim 2, wherein the single instruction comprises a plurality of opcodes, each opcode indicating that the operation type is a tensor operation.
11. The method of claim 2, wherein the single instruction comprises a plurality of opcodes, each opcode indicating that the operation type is a direct memory access, DMA, operation.
12. A system for accelerating tensor computation of a neural network having a plurality of neural network layers, the system comprising: Processor, and A non-transitory storage medium for storing instructions executable by the processor to cause performance of operations comprising: Obtaining a single instruction by a controller of a hardware computing module, the single instruction encoding a plurality of data values; Identifying, by the controller, an opcode in the single instruction, the opcode specifying a tensor operation for a layer type of a neural network layer of the plurality of neural network layers, the tensor operation traversing a multidimensional input tensor using a circularly nested set of instructions, wherein two or more instructions of the circularly nested set of instructions are independently executable by the controller such that tensor computation is performed, comprising: traversing a plurality of elements along a plurality of dimensions of the multi-dimensional input tensor based on the single instruction to obtain inputs stored at respective memory locations corresponding to the plurality of elements; executing a respective subset of operations in the loop-nested instruction set to process the inputs stored at the respective memory locations through the neural network layer, and An output of the neural network layer is generated by the hardware computation module based on the plurality of data values, the layer type specified by the opcode, and the input corresponding to the plurality of elements of the multi-dimensional input tensor.
13. The system of claim 12, wherein generating the output comprises: Performing a first portion of the tensor computation based on the single instruction; processing an input set through the neural network layer including the layer type specified by the opcode in response to the first portion performing the tensor calculation, and The output is generated based on the set of inputs processed by the neural network layer.
14. The system of claim 13, wherein: The input set derived from the multi-dimensional input tensor, and Each element of the plurality of elements corresponds to a respective input in the input set.
15. The system of claim 13, wherein the operations further comprise: determining that the layer type of the neural network layer is a convolutional layer type corresponding to a convolutional neural network layer based on the opcode, and Determining that the tensor calculation is for a convolution operation performed at the convolutional neural network layer based on the opcode.
16. The system of claim 13, wherein performing the tensor calculation comprises: Traversing a first plurality of elements at a first dimension of the multi-dimensional input tensor based on the single instruction.
17. The system of claim 16, wherein the multi-dimensional input tensor is an activation tensor, and the operations further comprise: Based on the single instruction, a plurality of activations of the activation tensor are preloaded into a first memory of a computing unit receiving the single instruction.
18. The system of claim 17, wherein traversing the first plurality of elements at the first dimension comprises: based on the single instruction, accessing a plurality of address locations of the first memory, Wherein each address location of the plurality of address locations corresponds to a respective element of the activation tensor along the first dimension of the activation tensor.
19. The system of claim 17, wherein the computing unit is instructed based on the single instruction to perform only a subset of the total computation amount required to traverse the multi-dimensional input tensor.
20. The system of claim 12, wherein the plurality of elements along the plurality of dimensions of the multi-dimensional input tensor comprise: a first plurality of elements at an x-dimension of the multi-dimensional input tensor; A second plurality of elements at the y-dimension of the multi-dimensional input tensor, and A third plurality of elements at the z-dimension of the multi-dimensional input tensor.

Description

Neural network instruction set architecture Description of the division The application belongs to a divisional application of a Chinese application patent application 201710909908.8 with the application date of 2017, 9 and 29. Technical Field The present specification relates to an instruction set for computation of deep neural networks ("DNNs"). Background Neural networks are machine learning models that employ one or more layers of the model to generate an output, e.g., a classification, for a received input. In addition to the output layer, some neural networks include one or more hidden layers. The output of each hidden layer is used as an input to the next layer in the network, i.e. the next hidden layer or output layer of the network. Each layer of the network generates an output from the received input based on the current value of the corresponding parameter array. Some neural networks include one or more convolutional neural network layers. Each convolutional neural network layer has a set of kernel correlations. Each kernel includes values built by a neural network model created by the user. In some implementations, the kernel identifies a particular image contour, shape, or color. The kernel can be represented as a matrix structure of weight inputs. Each convolution layer is also capable of processing a set of activation inputs. The set of activation inputs can also be represented as a matrix structure. Disclosure of Invention One inventive aspect of the subject matter disclosed in this specification can be embodied in computer-implemented methods. The method includes receiving, by a processing unit, an instruction specifying a data value for performing tensor computation. The method may include performing, by the processing unit, the tensor computation by performing a loop nest comprising a plurality of loops in response to receiving the instruction, wherein a structure of the loop nest is defined based on one or more of the data values of the instruction. These and other implementations can each optionally include one or more of the following features. For example, the tensor computation can be at least part of the computation of the neural network layer. The data values specified by the instructions include at least one value specifying a type of neural network layer, and wherein the structure of loop nesting is defined at least in part by the type of neural network layer. Thus, performing loop nesting comprising a plurality of loops may refer to traversing elements of a tensor in an order specified by a structure of nested loops, e.g., by a depth of loop nesting, and a start and end index, stride, and direction of each loop. In some implementations, the tensor computation is at least part of the computation of the neural network layer. In some implementations, the data values specified by the instructions include at least one value specifying a type of neural network layer, and wherein the structure of loop nesting is defined at least in part by the type of neural network layer. In some implementations, the instructions cause the processing unit to access at least one element of a dimension of the tensor that is part of at least one index used in performing loop nesting during execution of the tensor computation. In some implementations, the instructions cause the processing unit to access at least one memory address of an array in the storage medium, the memory address of the array including variables read by the processing unit during performing the tensor calculation. In some implementations, performing the tensor calculation includes providing, by the processing unit, at least one control signal to a Tensor Traversal Unit (TTU) to cause the TTU to issue a loop index for use in performing loop nesting during performing the tensor calculation. In some implementations, the method further includes providing, by the processing unit, at least one control signal to the TTU to cause the array reference of the TTU to generate an address of the referenced array element for use in performing loop nesting during performing tensor computation. In some implementations, the instruction indicates a first TTU counter that is added to a second TTU counter to generate an address for an array reference associated with the TTU. In some implementations, performing the tensor computation includes executing, by the processing unit, a first synchronization program that manages one or more operands associated with performing the tensor computation, wherein managing the operands includes stalling one or more loop nests based on a synchronization flag condition. In some implementations, performing tensor computation includes executing, by the processing unit, a second synchronization program that manages incrementing a counter associated with the characteristics of loop nesting. Another inventive aspect of the subject matter described in this specification can be implemented in an electronic system that includes a