US-20260127436-A1 - METHOD FOR GENERATING COMMAND SET FOR NEURAL NETWORK OPERATION, AND COMPUTING DEVICE FOR SAME

US20260127436A1US 20260127436 A1US20260127436 A1US 20260127436A1US-20260127436-A1

Abstract

Disclosed is a method for generating an NPU command, comprising the steps of: generating a p-th partial network having the same structure as a structure of a first network defined by a first group of layers included in a predefined neural network; determining, in a first memory included in another computing device, a p-th read address, which is a location of an address where a p-th partial input activation, which is data to be inputted to an uppermost layer of the p-th partial network, is stored; determining, in the first memory, a p-th write address, which is a location of an address where a p-th partial output activation, which is data outputted by a lowest layer of the p-th partial network, should be stored; and generating an NPU command on the basis of the p-th read address and the p-th write address.

Inventors

Hyun EUN

Assignees

OPENEDGES TECHNOLOGY, INC.

Dates

Publication Date: 20260507
Application Date: 20231005
Priority Date: 20221006

Claims (13)

1 . A method of creating an NPU command, comprising: generating, by a computing device, a p-th partial network having the same structure as a structure of a first network defined by a first group of layers included in a predefined neural network; determining, by the computing device, in a first memory included in another computing device, a p-th read address, which is a location of an address where a p-th partial input activation, which is data to be input to an uppermost layer of the p-th partial network, is stored; determining, by the computing device, in the first memory, a p-th write address, which is a location of an address where a p-th partial output activation, which is data output by a lowermost layer of the p-th partial network, is to be stored; and generating, by the computing device, an NPU command [p] including a first command set, a second command set, and a third command set, wherein the first combination set includes commands for causing an NPU included in the other computing device to read the P-th partial input activation from the first memory based on the P-th read address and store the P-th partial input activation in an internal memory of the NPU, the second command set includes commands for causing the NPU to generate the p-th partial output activation based on the p-th partial input activation stored in the internal memory, and the third command set includes commands for causing the NPU to store the p-th partial output activation in the first memory based on the p-th write address.
2 . The method of claim 1 , wherein the first memory is a memory provided outside the NPU, the p-th partial input activation is configured to be transferred from the first memory to the internal memory of the NPU through a bus of the other computing device, and the p-th partial output activation is configured to be transferred from the internal memory to the first memory through the bus.
3 . The method of claim 1 , wherein the p-th partial output activation is generated by performing operation on the p-th partial input activation stored in the internal memory based on operation rules of layers included in the p-th partial network.
4 . The method of claim 1 , wherein the generating of the p-th partial network comprises: defining, by the computing device, the first group composed of a plurality of consecutive layers included in a predefined neural network; generating, by the computing device, structure information about the first network composed of a plurality of layers included in the defined first group and a plurality of links; and generating, by the computing device, the p-th partial network having the same structure as the first network, and the structure information about the first network is information about layers constituting the first group, operation rules of the layers, and links indicating activation movement paths between the layers.
5 . The method of claim 1 , wherein the first group comprises a plurality of layers, the uppermost layer is a layer of the plurality of layers that receives an activation from outside the first group, and the lowermost layer is a layer of the plurality of layers that provides an activation to outside the first group.
6 . The method of claim 1 , wherein the p-th partial input activation is a part of an input activation to be input to an uppermost layer among the first group of the layers.
7 . A method of creating an NPU command, comprising: generating, by a computing device, a partitioned network including a p-th partial network based on a first network composed of a first group of layers included in a predefined neural network (p is 1, 2, , and P); and generating, by the computing device, an NPU command [p] that is configured to be executed by an NPU included in another computing device with respect to the p-th partial network (p is 1, 2, , or P), wherein the generating of the partitioned network comprises: defining, by the computing device, a p-th slice layer configured to receive an input activation to be input to the first group and output a partial input activation that is a part of the input activation (p is 1, 2, , and P); defining, by the computing device, a p-th partial network that receives a p-th partial input activation output from the p-th slice layer (p is 1, 2, , and P); defining, by the computing device, a concatenation layer that combines P partial output activations output from the P partial networks to each other; and completing, by the computing device, the partitioned network by defining a plurality of links indicating activation movement paths between the P slice layers, the P partial networks, and the concatenation layer.
8 . The method of claim 7 , wherein the p-th partial input activation is a part of an input activation configured to be input to an uppermost layer among the first group of the layers, and the input activation is restored using the first partial input activation to the P-th partial input activation.
9 . The method of claim 7 , wherein a structure of the p-th partial network is the same as a structure of the first network (p is 1, 2, , and P), the generating of the NPU command [p] comprises: determining, by the computing device, in a first memory included in another computing device, a p-th read address, which is a location of an address where a p-th partial input activation, which is data to be input to an uppermost layer of the p-th partial network, is stored; determining, by the computing device, in the first memory, a p-th write address, which is a location of an address where a p-th partial output activation, which is data output by a lowermost layer of the p-th partial network, is to be stored; and generating, by the computing device, an NPU command [p] including a first command set, a second command set, and a third command set, the first command set includes commands for causing the NPU to read the P-th partial input activation from the first memory based on the P-th read address and store the P-th partial input activation in an internal memory of the NPU, the second command set includes commands for causing the NPU to generate the p-th partial output activation based on the p-th partial input activation stored in the internal memory, and the third command set includes commands for causing the NPU to store the p-th partial output activation in the first memory based on the p-th write address.
10 . The method of claim 9 , wherein the first memory is a memory provided outside the NPU, the p-th partial input activation is configured to be transferred from the first memory to the internal memory of the NPU through a bus of the other computing device, and the p-th partial output activation is configured to be transferred from the internal memory to the first memory through the bus.
11 . The method of claim 7 , wherein the generating of the p-th partial network comprises: defining, by the computing device, the first group composed of a plurality of consecutive layers included in a predefined neural network; generating, by the computing device, structure information about the first network composed of a plurality of layers included in the defined first group and a plurality of links; and generating, by the computing device, the p-th partial network having the same structure as the first network, the structure information about the first network is information about layers constituting the first group, operation rules of the layers, and links indicating activation movement paths between the layers.
12 . A computing device comprising: a storage unit; and a main processor, wherein, in the storage unit, a program comprising commands that cause the main processor to execute: generating a p-th partial network having the same structure as a structure of a first network defined by a first group of layers included in a predefined neural network; determining, in a first memory included in another computing device, a p-th read address, which is a location of an address where a p-th partial input activation, which is data to be input to an uppermost layer of the p-th partial network, is stored; determining, in the first memory, a p-th write address, which is a location of an address where a p-th partial output activation, which is data output by a lowermost layer of the p-th partial network, is to be stored; and generating an NPU command [p] including a first command set, a second command set, and a third command set, is written, the first command set includes commands for causing an NPU included in the other computing device to read the P-th partial input activation from the first memory based on the P-th read address and store the P-th partial input activation in an internal memory of the NPU, the second command set includes commands for causing the NPU to generate the p-th partial output activation based on the p-th partial input activation stored in the internal memory, and the third command set includes commands for causing the NPU to store the p-th partial output activation in the first memory based on the p-th write address.
13 . A computing device comprising: a storage unit; and a main processor, wherein, in the storage unit, a program comprising commands that cause the main processor to execute: generating a partitioned network including a p-th partial network based on a first network composed of a first group of layers included in a predefined neural network (p is 1, 2, , and P); and generating an NPU command [p] that is configured to be execute by an NPU included in another computing device with respect to the p-th partial network (p is 1, 2, , or P), is written, and the generating of the partitioned network comprises: defining, by the computing device, a p-th slice layer configured to receive an input activation to be input to the first group and output a partial input activation that is a part of the input activation (p is 1, 2, , and P); defining, by the computing device, a p-th partial network that receives a p-th partial input activation output from the p-th slice layer (p is 1, 2, , and P); defining, by the computing device, a concatenation layer that combines P partial output activations output from the P partial networks to each other; and completing, by the computing device, the partitioned network by defining a plurality of links indicating activation movement paths between the P slice layers, the P partial networks, and the concatenation layer.

Description

TECHNICAL FIELD The present invention relates to a technology for generating commands to improve efficiency of a neural network operation and utilization efficiency of computing resources in a computing device including a neural processing unit NPU. BACKGROUND ART This invention relates to a neural network operation executed in an NPU installed on a computing device. In FIG. 1, an example of a neural network operation is illustrated using a convolutional neural network (CNN) as an example. FIG. 1 illustrates an operation structure of the CNN according to an embodiment. Hereinafter, a description will be given with reference to FIG. 1. First, convolution layers 52 may be generated by performing convolution operations using a plurality of kernels on input image data 51 stored in an internal memory. The generating of the convolution layers 52 may include performing a non-linear operation (e.g., ReLU, Sigmoid, or tanH) on a plurality of feature maps obtained as a result of performing the convolution operation. Next, pooling layers 53 may be generated by performing pooling for the convolution layers 52. Each convolution layer 52 may include data which can be represented in the form of an MAN matrix. Next, an array to be input to an internal neural network 54 may be generated by performing flattening on the pooling layers 53. Next, an output may be generated from the internal neural network 54 by inputting the array into the internal neural network 54. All operation processes distinguished from each other illustrated in FIG. 1 may be considered to be different layers. In addition, the neural network according to the present invention may be considered to include all layers illustrated in FIG. 1, or the neural network may be considered to mean the internal neural network 54. FIG. 1 is an example to help understanding, and thus the scope of the neural network according to the present invention is not limited to the above-described content. In the neural network, data can be operated and converted each time it encounters a layer while moving along the direction. This conversion and flow of data can be expressed in terms of a stream. The neural network may include a first layer and a second layer. In this case, if an output activation output from the first layer is input to the second layer as it is or after being further converted, the first layer may be referred to as a layer existing further upstream than the second layer, and the second layer may be referred to as a layer existing further downstream than the first layer. The terms upstream and downstream are introduced for the convenience of the description of the present invention. A computing device, such as a desktop computer, a laptop computer, a smartphone, and a tablet, may be equipped with a neural processing unit NPU. The NPU may have a structure suitable for a neural network operation. In this case, in order for the NPU to execute the neural network operation, a controller in the NPU should execute predetermined commands for the neural network operation to control resources in the NPU. The commands may be stored in the NPU in a process of manufacturing the user device, or may be provided to the NPU even after the user device is manufactured. When causing a predetermined neural network to be operated on the NPU, a size of input/output data of a specific layer defined in the predetermined neural network may be larger than the internal memory within the NPU. In this case, it is necessary to divide and process the input/output data into a size large enough to be stored in the internal memory. In order to execute an operation corresponding to one specific layer, the NPU may obtain input data required for the operation, such as an input activation and other input data (e. g., weights, etc.) that to be input to the specific layer, from a memory e. g., DRAM) external to the NPU through a bus. Also, an output activation (output data) output by the one specific layer may be again provided to the memory external to the NPU through the bus. Since a write/read operation is performed in an external memory through a bus whenever an operation for each layer is performed, there is a problem that, as the number of layers in the neural network increases, more computing resources are consumed and the overall operation efficiency also decreases. This problem also occurs when dividing the input/output data into a size large enough to be stored in the internal memory and performing operations. Since layers constituting a neural network may have a large number of input/output connection shapes between layers by a neural network manufacturer, it is difficult to perform effective operation division for all connection cases. As a result, for this reason, there is a problem that efficient hardware operation is difficult in terms of power and bandwidth. In one implementation of a neural network operation method, data such as input tensors, layer parameters, weights, and biases are r