US-20260127430-A1 - DYNAMIC QUANTIZATION FOR ENERGY EFFICIENT DEEP LEARNING

US20260127430A1US 20260127430 A1US20260127430 A1US 20260127430A1US-20260127430-A1

Abstract

A method performed by a deep neural network (DNN) includes receiving, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN. The method also includes quantizing one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input. The method further includes performing a task corresponding to the DNN input, the task performed with the one or more one quantized parameters.

Inventors

Randy ARDYWIBOWO
Venkata Ravi Kiran Dayana
Hau Hwang

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20251104

Claims (20)

1 . A method performed by a deep neural network (DNN), comprising: receiving, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN; quantizing one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input; and performing a task corresponding to the DNN input, the task performed with the one or more one quantized parameters.
2 . The method of claim 1 , in which the plurality of parameters comprise a set of weights and a set of activations.
3 . The method of claim 2 , in which quantizing the one or more parameters comprises quantizing one or both of the respective set of weights or the respective set of activations of one or more output channels associated with the layer.
4 . The method of claim 1 , in which a first amount of quantization of a first parameter of the plurality of parameters is different than a second amount of quantization of a second parameter of the plurality of parameters.
5 . The method of claim 1 , in which quantizing the one or more parameters comprises generating an adjusted bit-width by adjusting a size of an original bit-width associated with the one or more parameters.
6 . The method of claim 5 , in which generating the adjusted bit-width comprising discarding bits of the original bit-width from least significant bits to most significant bits until the size of the original bit-width equals a size of the adjusted bit-width determined based on the content of the layer input.
7 . The method of claim 6 , further comprising training the DNN to determine the size for adjusting the original bit-width based on a total loss, the total loss being a function of a performance loss and a regularization loss.
8 . The method of claim 7 , in which the performance loss determines a cross-entropy loss or a mean-squared error.
9 . The method of claim 8 , in which the regularization loss is a bitwise L0 regularization loss that penalizes one of: the adjusted bit-width and a complexity metric associated with a bit-level operation; or a number of bits allocated to the adjusted bit-width.
10 . The method of claim 9 , in which the complexity metric comprises one or more of a number of binary operations of the DNN, a memory footprint of the DNN, or a computer power of the DNN.
11 . The method of claim 9 , further comprising: reformulating the bitwise L0 regularization loss as a Bernoulli distribution; relaxing the reformulated bitwise L0 regularization loss based on a sigmoid function; and minimizing the performance loss and the regularization loss based on the number of bits selected for the adjusted bit-width.
12 . The method of claim 1 , in which the layer is one layer of a plurality of layers of the DNN, and the method further comprises quantizing one or more parameters of a respective plurality of parameters of each layer of the plurality of layers based on content of a respective layer input.
13 . The method of claim 12 , in which a quantization amount is different for each layer of the plurality of layers.
14 . An apparatus for implementing a deep neural network (DNN), comprising: a processor; a memory coupled with the processor; and instructions stored in the memory and operable, when executed by the processor, to cause the apparatus to: receive, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN; quantize one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input; and perform a task corresponding to the DNN input, the task performed with the one or more one quantized parameters.
15 . The apparatus of claim 14 , in which the plurality of parameters comprise a set of weights and a set of activations.
16 . The apparatus of claim 15 , in which the instructions further cause the apparatus to quantize the one or more parameters by quantizing one or both of the respective set of weights or the respective set of activations of one or more output channels associated with the layer.
17 . The apparatus of claim 14 , in which a first amount of quantization of a first parameter of the plurality of parameters is different than a second amount of quantization of a second parameter of the plurality of parameters.
18 . The apparatus of claim 14 , in which the instructions further cause the apparatus to quantize the one or more parameters by generating an adjusted bit-width by adjusting a size of an original bit-width associated with the one or more parameters.
19 . The apparatus of claim 18 , in which the instructions further cause the apparatus to generate the adjusted bit-width by discarding bits of the original bit-width from least significant bits to most significant bits until the size of the original bit-width equals a size of the adjusted bit-width determined based on the content of the layer input.
20 . The apparatus of claim 19 , in which the instructions further cause the apparatus to determine, during a training stage, the size for adjusting the original bit-width based on a total loss, the total loss being a function of a performance loss and a regularization loss.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation of U.S. patent application Ser. No. 17/488,261, filed on Sep. 28, 2021, and titled “DYNAMIC QUANTIZATION FOR ENERGY EFFICIENT DEEP LEARNING,” which claims the benefit of U.S. Provisional Patent Application No. 63/084,902, filed on Sep. 29, 2020, and titled “DYNAMIC QUANTIZATION FOR ENERGY EFFICIENT DEEP LEARNING,” the disclosures of which are expressly incorporated by reference in their entireties. BACKGROUND Field Aspects of the present disclosure generally relate to dynamic quantization for energy efficient deep learning neural networks. Background Convolutional neural networks, such as deep convolutional neural networks (DCNNs), may use a large amount of computational and storage resources. As such, it may be difficult to deploy conventional neural networks on systems with limited resources, such as cloud systems, embedded systems, or federated learning systems. Some conventional neural networks are pruned and/or quantized to reduce processor load and/or memory use. It is desirable to improve quantization methods to reduce computational and storage resources. SUMMARY In one aspect of the present disclosure, a method performed by a deep neural network (DNN) is disclosed. The method includes receiving, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN. The method also includes quantizing one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input. The method further includes performing a task corresponding to the DNN input, the task performed with the one or more one quantized parameters. Another aspect of the present disclosure is directed to an apparatus including means for receiving, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN. The apparatus also includes means for quantizing one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input. The apparatus further includes means for performing a task corresponding to the DNN input, the task performed with the one or more one quantized parameters. In another aspect of the present disclosure, a non-transitory computer-readable medium with non-transitory program code recorded thereon is disclosed. The program code is for a DNN. The program code is executed by a processor and includes program code to receiving, at a layer of the DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN. The program code also includes program code to quantize one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input. The program code further includes program code to perform a task corresponding to the DNN input, the task performed with the one or more one quantized parameters. Another aspect of the present disclosure is directed to an apparatus. The apparatus having a memory, one or more processors coupled to the memory, and instructions stored in the memory. The instructions being operable, when executed by the processor, to cause the apparatus to receive, at a layer of a DNN during an inference stage, a layer input comprising content associated with a DNN input received at the DNN. The instructions also cause the apparatus to quantize one or more parameters of a plurality of parameters associated with the layer based on the content of the layer input. The instructions additionally cause the apparatus to perform a task corresponding to the DNN input, the task performed with the one or more one quantized parameters. Aspects generally include a method, apparatus, system, computer program product, non-transitory computer-readable medium, user equipment, base station, wireless communication device, and processing system as substantially described with reference to and as illustrated by the accompanying drawings and specification. The foregoing has outlined rather broadly the features and technical advantages of examples according to the disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter. The conception and specific examples disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the scope of the appended claims. Characteristics of the concepts disclosed, both their organization and method of operation, together with associated advantages will be better understood from the following description when considered in connection with the accompanying figures. Each of the figures is provided for the purposes of illustration and description, and not as a definition of the limits of