US-20260127439-A1 - APPARATUS, METHOD, AND SYSTEM FOR DEPLOYING NEURAL NETWORK MODEL

US20260127439A1US 20260127439 A1US20260127439 A1US 20260127439A1US-20260127439-A1

Abstract

A method may comprise receiving a first neural network (NN) model including one or more functions; generating a second NN model in a form of directed acyclic graph (DAG) including one or more graph modules by converting the one or more functions; calculating one or more scale values by obtaining maximum and minimum values of parameters input to the one or more graph modules; updating the parameters based on the one or more scale values; and generating a third NN model, in a form of machine code executable on a particular neural processing unit, including the updated parameters.

Inventors

Lok Won Kim
Jang Min SON
You Jun KIM
Bum Jun JUNG

Assignees

DEEPX CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20260102
Priority Date: 20240202

Claims (20)

1 . A neural network model deployment apparatus, comprising: a communication interface to communicate with an edge device; a memory storing instructions; and at least one processor configured to execute the instructions to: convert a source neural network (NN) model into a graph-based NN model having a graph structure; obtain parameter values from nodes of the graph-based NN model by applying a calibration dataset; determine scale values based on a difference between a maximum value and a minimum value of the obtained parameter values, wherein each scale value defines a quantization resolution; generate machine code executable by a neural processing unit (NPU) of the edge device by updating parameters of the graph-based NN model based on the scale values and compiling the graph-based NN model according to hardware attributes of the NPU; and transmit the machine code to the edge device via the communication interface.
2 . The neural network model deployment apparatus of claim 1 , wherein the hardware attributes include at least one of an internal memory capacity, a number of processing elements, and supported operation types of the NPU.
3 . The neural network model deployment apparatus of claim 1 , wherein the machine code includes a multiply-and-accumulation operation comprising at least one of a convolution operation and a matrix multiplication operation to be executed by the NPU.
4 . The neural network model deployment apparatus of claim 1 , wherein the machine code is configured to cause the NPU to process weight parameters as integers and process an output of an activation function as a floating-point number.
5 . The neural network model deployment apparatus of claim 1 , wherein the instructions further cause the at least one processor to include quantization parameters in the machine code, the quantization parameters comprising the scale values and offset values associated with the scale values, to be processed by a dequantization circuit of the NPU.
6 . The neural network model deployment apparatus of claim 1 , wherein the updating of the parameters is performed based on compilation options including at least one of outlier alleviation, parameter refinement, layer-wise training, and quantization-aware self-distillation (QASD).
7 . The neural network model deployment apparatus of claim 6 , wherein the outlier alleviation includes adjusting input parameters and weight parameters of a target graph module in the graph-based NN model to mitigate outliers prior to a multiply-and-accumulate operation.
8 . A method for deploying a neural network model, performed by a deployment apparatus, the method comprising: converting a source neural network (NN) model into a graph-based NN model having a graph structure by transforming function call instructions of the source NN model into corresponding graph modules; obtaining parameter values from nodes of the graph-based NN model by applying a calibration dataset; determining scale values based on a difference between a maximum value and a minimum value of the obtained parameter values; generating machine code executable by a neural processing unit (NPU) of an edge device by updating parameters of the graph-based NN model based on the scale values and compiling the graph-based NN model according to hardware attributes of the NPU; and transmitting the machine code to the edge device.
9 . The method of claim 8 , wherein the determining of the scale values includes calculating a scale value based on a target quantization bitwidth and the difference between the maximum value and the minimum value.
10 . The method of claim 8 , wherein the updating of the parameters includes performing parameter refinement by selecting updated scale values that maximize a cosine similarity between a computation result of the graph-based NN model with quantization and a computation result without quantization.
11 . The method of claim 8 , wherein the updating of the parameters includes performing layer-wise training to update weight parameters of each layer of the graph-based NN model to reduce a quantization loss relative to an output of a corresponding layer before quantization.
12 . The method of claim 8 , wherein the updating of the parameters includes performing quantization-aware self-distillation (QASD) by retraining the graph-based NN model using an output of the source NN model as a teacher signal.
13 . The method of claim 8 , wherein the updating of the parameters includes performing pruning by zeroing weight parameters based on a threshold value or a predetermined mask pattern.
14 . The method of claim 8 , wherein the hardware attributes of the NPU include configuration information of an on-chip memory and a plurality of processing elements included in the NPU.
15 . A neural network porting system comprising: a deployment apparatus configured to convert a source neural network model into a graph-based model, collect parameter values using a calibration dataset, determine scale values based on a difference between maximum and minimum values of the collected parameter values, generate machine code using parameters updated based on the scale values, and transmit the machine code; and an edge device comprising a neural processing unit (NPU) and an on-chip memory, wherein the edge device is configured to receive the machine code and execute the machine code using the NPU to perform inference.
16 . The neural network porting system of claim 15 , wherein the NPU comprises: first circuitry provided for the on-chip memory; second circuitry provided for a plurality of processing elements configured to perform multiply-and-accumulate operations using integer weight parameters included in the machine code; and third circuitry provided for a controller configured to manage data flow between the on-chip memory and the plurality of processing elements according to the machine code.
17 . The neural network porting system of claim 16 , wherein the NPU further comprises a dequantization circuit configured to convert integer output parameters from the plurality of processing elements into floating-point parameters using the scale values included in the machine code.
18 . The neural network porting system of claim 17 , wherein the NPU further comprises an activation function circuit configured to apply an activation function to the floating-point parameters.
19 . The neural network porting system of claim 18 , wherein the NPU further comprises a quantization circuit configured to convert an output of the activation function circuit into integer parameters.
20 . The neural network porting system of claim 15 , wherein the deployment apparatus is configured to generate the machine code such that the machine code causes the NPU to execute operations in an order determined based on data locality of the graph-based model and an on-chip memory size.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This is a continuation of U.S. patent application Ser. No. 19/035,231, filed on Jan. 23, 2025, which is a continuation-in-part application of U.S. Ser. No. 18/603,346, filed on Mar. 13, 2024, which claims priority to Republic of Korea Patent Application No. 10-2024-0016687, filed on Feb. 2, 2024, Republic of Korea Patent Application No. 10-2024-0048548, filed on Apr. 11, 2024, Republic of Korea Patent Application No. 10-2024-0102154, filed on Jul. 31, 2024, and Republic of Korea Patent Application No. 10-2024-0128002, filed on Sep. 23, 2024, which are incorporated herein by reference herein in their entirety. BACKGROUND OF THE DISCLOSURE Technical Field The present disclosure relates to techniques for improving neural network models operating on low-power neural processing units at the edge devices. Background Art The human brain is made up of tons of nerve cells called neurons. Each neuron is connected to hundreds to thousands of other neurons through connections called synapses. To mimic human intelligence, modeling the behavior of biological neurons and the connections between them is called a neural network (NN) model. In other words, a neural network is a system of nodes that mimic neurons, connected in a layer structure. These neural network models are categorized into “single-layer neural networks” and “multi-layer neural networks” based on the number of layers. A typical multilayer neural network consists of an input layer, a hidden layer, and an output layer. The input layer is the layer that receives external data, and the number of neurons in the input layer can correspond to the number of input variables. At least one hidden layer is located between the input and output layers and receives signals from the input layer, extracts characteristics and passes them to the output layer. The output layer receives signals from the at least one hidden layer and outputs them to the outside world. The input signals between neurons are multiplied by their respective connection strengths, which have a value between 0 and 1, and then summed up, and if the sum is greater than the neuron's threshold, the neuron is activated and output as an output value through the activation function. On the other hand, in order to realize higher artificial intelligence, the number of hidden layers of neural networks is increased, and it is called a deep neural network (DNN). There are many types of DNNs, but convolutional neural network (CNN) is known to be easy to extract features of input data and identify patterns of features. A convolutional neural network (CNN) is a neural network that functions similarly to how the visual cortex of the human brain processes images. Convolutional neural networks are known to be well suited for image processing. A convolutional neural network may include a loop of convolutional and pooling channels. In a convolutional neural network, most of the computation time is taken up by the convolutional operation. Convolutional neural networks recognize objects by extracting the features of each channel's image by a matrix-like kernel and providing homeostasis such as translation and distortion by pooling. In each channel, a feature map is obtained by convolution of the input data and the kernel, and an activation function such as rectified linear unit (ReLU) is applied to generate an activation map for that channel and pooling can then be applied thereafter. The neural network that actually classifies the pattern is located at the end of the feature extraction neural network and is called the fully connected layer. In the computational processing of a convolutional neural network, most of the computation is done through convolutional or matrix operations. With the development of AI inference capabilities, various electronic devices such as AI speakers, smartphones, smart refrigerators, VR devices, AR devices, AI CCTV, AI robot vacuum cleaners, tablets, laptops, self-driving cars, bipedal robots, quadrupedal robots, industrial robots, and the like are providing various inference services such as sound recognition, speech recognition, image recognition, object detection, driver drowsiness detection, danger moment detection, and gesture detection using AI. With the recent development of deep learning technology, the performance of neural network inference services is improving through big data-based learning. These neural network inference services repeatedly train a large amount of training data on a neural network, and infer various complex data through the trained neural network model. Therefore, various services are being provided to the above-mentioned electronic devices by utilizing neural network technology. In addition, in recent years, neural processing units (NPUs) have been developed to accelerate the computation speed for artificial intelligence (AI). However, as the capabilities and accuracy required for inference services utilizing neural networks