US-12626132-B2 - Method and apparatus for compressing neural network model by using device characteristics

US12626132B2US 12626132 B2US12626132 B2US 12626132B2US-12626132-B2

Abstract

Provided are a method and apparatus for compressing a neural network model by using device characteristics. The method includes: obtaining the neural network model that is executed by a device; adjusting a target number of output channels of a target layer included in the neural network model, based on an arithmetic intensity obtained from a roofline model and a latency characteristic of a staircase pattern of the device; and compressing the neural network model such that the number of output channels of the target layer is equal to the adjusted target number of output channels.

Inventors

Shin Kook Choi
Jun Kyeong CHOI

Assignees

NOTA, INC.

Dates

Publication Date: 20260512
Application Date: 20221118
Priority Date: 20221031

Claims (11)

1 . A method of compressing a neural network model, which is performed by a computing device, by using device characteristics, the method comprising: obtaining the neural network model that is executed by a device, adjusting a target number of output channels of a target layer included in the neural network model, based on an arithmetic intensity obtained from a roofline model and a latency characteristic of a staircase pattern of the device; and compressing the neural network model by using at least one of channel pruning and filter decomposition on the target layer to be compressed such that a number of output channels of the target layer is equal to the adjusted target number of output channels; wherein the adjusting comprises: comparing an arithmetic intensity of the target layer with an arithmetic intensity of a next layer which is connected to the target layer; selecting, as a reference layer, a layer having a higher arithmetic intensity based on a result of the comparison; and adjusting the target number of output channels based on a latency characteristic of the selected reference layer, wherein in a case in which the selected reference layer is the target layer, the adjusting is based on a first latency characteristic for the output channels of the target layer and in a case in which the reference layer is the next layer, the adjusting is based on a second latency characteristic for input channels of the next layer.
2 . The method of claim 1 , wherein the latency characteristic of the reference layer is one of a latency characteristic for the number of output channels of the target layer and a latency characteristic for the number of input channels of the next layer and, wherein the target number of output channels is equal to the number of input channels of the next layer.
3 . The method of claim 1 , wherein the adjusting further comprises: determining, as a reference step, a step corresponding to the target number of output channels from among a plurality of steps of the latency characteristic related to the reference layer; and adjusting the target number of output channels, based on the number of channels in any one of a largest number of channels of the reference step and a largest number of channels of a previous step having a lower latency than the latency of the reference step; wherein each of the plurality of steps is a section of the number of channels corresponding to the same latency across a step size.
4 . The method of claim 1 , wherein the adjusting further comprises: determining, as a reference step, a step corresponding to the target number of output channels from among a plurality of steps of the latency characteristic related to the reference layer; and adjusting the target number of output channels to be a number of channels that minimizes the adjusting the number of target output channels among a largest number of channels of the reference step and a largest number of channels of a previous step having a lower latency than the latency of the reference step.
5 . The method of claim 1 , wherein the adjusting further comprises: adjusting the target number of output channels to be a number of channels that maximizes the performance of the neural network model among a plurality of steps of the latency characteristic related to the reference layer, adjusting the target number of output channels to be a number of channels that reduces the latency of the neural network model among a plurality of steps of the latency characteristic related to the reference layer, or adjusting the target number of output channels to be a number of channels that minimizes the adjusting the number of target output channels among a number of channels that maximizes the performance of the neural network model and a number of channels that reduces the latency of the neural network model, among a plurality of steps of the latency characteristic related to the reference layer, wherein each of the plurality of steps is a section of the number of channels corresponding to the same latency across a step size.
6 . A method of compressing a neural network model which is performed by a computing device, by using device characteristics, the method comprising: obtaining the neural network model that is executed by a device; adjusting a target number of output channels of a target sub-layer of a layer to be compressed from among layers included in the neural network model, based on an arithmetic intensity obtained from a roofline model and a latency characteristic of a staircase pattern of the device; and compressing the neural network model by decomposing the layer to be compressed such that the number of output channels of the target sub-layer is equal to the adjusted target number of output channels; wherein the target sub-layer is one of a plurality of sub-layers generated by decomposing the layer to be compressed; wherein the adjusting comprises: comparing an arithmetic intensity of a target sub-layer with an arithmetic intensity of a next sub-layer which is connected to the target sub-layer; selecting, as a reference sub-layer, a sub-layer having a higher arithmetic intensity based on a result of the comparison; and adjusting the target number of output channels based on a latency characteristic of the selected reference sub-layer, wherein in a case in which the selected reference sub-layer is the target sub-layer, the adjusting is based on a first latency characteristic for the output channels of the target sub-layer and in a case in which the reference sub-layer is the next sub-layer, the adjusting is based on a second latency characteristic for input channels of the next sub-layer.
7 . The method of claim 6 , wherein the latency characteristic of the reference sub-layer is one of the latency characteristic for the number of output channels of the target sub-layer and the latency characteristic for the number of input channels of the next sub-layer and, wherein the target number of output channels is equal to the number of input channels of the next sub-layer.
8 . The method of claim 6 , wherein the adjusting further comprises, determining, as a reference step, a step corresponding to the target number of output channels from among a plurality of steps of the latency characteristic related to the reference sub-layer; and adjusting the target number of output channels, based on the number of channels in any one of a largest number of channels of the reference step and a largest number of channels of a previous step having a lower latency than the latency of the reference step; wherein each of the plurality of steps is a section of the number of channels corresponding to the same latency across a step size.
9 . The method of claim 6 , wherein the adjusting further comprises determining, as a reference step, a step corresponding to the target number of output channels from among a plurality of steps of the latency characteristic related to the reference sub-layer; and adjusting the target number of output channels to be a number of channels that minimizes the adjusting the number of target output channels among a largest number of channels of the reference step and a largest number of channels of a previous step having a lower latency than the latency of the reference step.
10 . The method of claim 6 , wherein the adjusting further comprises: adjusting the target number of output channels to be a number of channels that maximizes the performance of the neural network model among a plurality of steps of the latency characteristic related to the reference sub-layer, adjusting the target number of output channels to be a number of channels that reduces the latency of the neural network model among a plurality of steps of the latency characteristic related to the reference sub-layer, or adjusting the target number of output channels to be a number of channels that minimizes the adjusting the number of target output channels among a number of channels that maximizes the performance of the neural network model and a number of channels that reduces the latency of the neural network model, among a plurality of steps of the latency characteristic related to the reference sub-layer, wherein each of the plurality of steps is a section of the number of channels corresponding to the same latency across a step size.
11 . An apparatus for performing, by a computing device, a compression of a neural network model, by using device characteristics, the apparatus comprising: a memory storing at least one program; and at least one processor configured to execute the neural network model by executing the at least one program, wherein the at least one processor is further configured to: obtain the neural network model that is executed by a device; adjust a target number of output channels of a target layer included in the neural network model, based on an arithmetic intensity obtained from a roofline model and a latency characteristic of a staircase pattern of the device; compare an arithmetic intensity of the target layer with an arithmetic intensity of a next layer which is connected to the target layer; select, as a reference layer, a layer having a higher arithmetic intensity based on a result of the comparison; adjust the target number of output channels of the target layer based on a first latency characteristic of the selected reference layer, wherein in a case in which the selected reference layer is the target layer, the adjusting is based on a first latency characteristic for the output channels of the target layer and in a case in which the reference layer is the next layer, the adjusting is based on a second latency characteristic for input channels of the next layer; and compress the neural network model by using at least one of channel pruning and filter decomposition on the target layer to be compressed such that a number of output channels of the target layer is equal to the adjusted target number of output channels.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is based on and claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2022-0142354, filed on Oct. 31, 2022, in the Korean Intellectual Property Office, the disclosure of which is incorporated by reference herein in its entirety. BACKGROUND 1. Field The present disclosure provides a method and apparatus for compressing a neural network model by using device characteristics. 2. Description of the Related Art Convolutional neural networks (CNNs) are models that extract a feature map by using a plurality of convolutional layers and reduces dimensionality through subsampling to obtain only important parts from the feature map. CNNs are essential in various industries, such as image classification, object detection, or image segmentation. However, a CNN is based on numerous model parameters and computations, the performance improvement of the CNN results in a larger model size, amount of computation, and memory footprint. Therefore, it is difficult to use CNNs on devices with limited computational performance, such as mobile devices, autonomous vehicles, or edge computing devices. To solve such an issue, a pruning technique is used to reduce the size of a neural network model by removing unnecessary parameters. A weight pruning technique may achieve a significantly high compression ratio by removing weights with low importance in a filter, but creates unstructured sparsity, and thus, computing speed in general-purpose device environments, such as central processing units (CPUs) or graphics processing units (GPUs), may be improved to only a limited extent or even deteriorate. A filter pruning technique removes a filter of a convolutional layer to change only the dimensionality of a weight tensor, thus, is suitable for general-purpose device, and enables actual inference acceleration without special software or device support. In addition, it may be easily applied to various CNN models, and thus has high scalability and compatibility. Many studies have been conducted based on the advantages of the filter pruning technique, and, among them, automatic channel pruning methods have the advantage of finding the optimal structure of a pruned network by finding the optimal channel number in each layer, rather than manually or heuristically selecting channels to be pruned. The related art described above is technical information that the inventor(s) of the present disclosure has achieved to derive the present disclosure or has achieved during the derivation of the present disclosure, and thus, it cannot be considered that the related art has been published to the public before the filing of the present disclosure. SUMMARY The present disclosure provides a method and apparatus for compressing a neural network model by using device characteristics. Technical objects of the present disclosure are not limited to the foregoing, and other unmentioned objects or advantages of the present disclosure would be understood from the following description and be more clearly understood from the embodiments of the present disclosure. In addition, it would be appreciated that the objects and advantages of the present disclosure can be implemented by means provided in the claims and a combination thereof. Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments of the present disclosure. A first aspect of the present disclosure may provide a method of compressing a neural network model, which is performed by a computing device, by using device characteristics including: obtaining a target number of output channels of a target layer to be compressed, among layers included in the neural network model that is executed by a device; adjusting the target number of output channels to meet a certain purpose, based on at least one of a first latency characteristic for output channels of the target layer and a second latency characteristic for input channels of a next layer, which is connected to the target layer, according to the device characteristics of the device; and compressing the neural network model such that a number of output channels of the target layer is equal to the adjusted target number of output channels. A second aspect of the present disclosure may provide an apparatus for compressing a neural network model by using device characteristics including: a memory storing at least one program; and at least one processor configured to execute the neural network model by executing the at least one program, wherein the at least one processor is further configured to obtain a target number of output channels of a target layer to be compressed, among layers included in the neural network model that is executed by device, adjust the target number of output channels to meet a certain purpose, based on at least one of a first latency cha