US-20260127421-A1 - QUANTIZATION FOR NEURAL NETWORK

US20260127421A1US 20260127421 A1US20260127421 A1US 20260127421A1US-20260127421-A1

Abstract

A described example relates to a processor-implemented method that includes receiving a set of input data to a neural network. The method also includes selecting, from multiple sets of quantization scales for the neural network, a set of quantization scales for the neural network and the set of input data. Each set of the multiple sets of quantization scales is stored in memory and is associated with a respective input data cluster of multiple input data clusters. The method also includes performing an inferencing operation on the set of input data using the neural network with the selected set of quantization scales.

Inventors

Arthur Redfern
John Robertson

Assignees

TEXAS INSTRUMENTS INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20251031

Claims (20)

1 . A processor-implemented method comprising: receiving a set of input data to a neural network; selecting, from multiple sets of quantization scales for the neural network, a set of quantization scales for the neural network and the set of input data, wherein each set of the multiple sets of quantization scales is stored in memory and is associated with a respective input data cluster of multiple input data clusters; and performing an inferencing operation on the set of input data using the neural network with the selected set of quantization scales.
2 . The processor-implemented method of claim 1 , wherein selecting the set of quantization scales for the neural network comprises: extracting, from the set of input data, one or more features of the set of input data; and classifying the set of input data into a first input data cluster of the multiple input data clusters based on at least a portion of the one or more features of the set of input data, wherein the selected set of quantization scales is stored in the memory for the first input data cluster.
3 . The processor-implemented method of claim 2 , wherein the neural network is a first neural network, and classifying the set of input data comprises: predicting, by a second neural network, that the set of input data belongs to the first input data cluster of the multiple input data clusters more than any other input data clusters of the multiple input data clusters, wherein the second neural network is trained to predict which of the multiple input data clusters a respective set of input data is associated with.
4 . The processor-implemented method of claim 2 , wherein the neural network includes a first network portion and a second network portion, the first network portion comprises one or more layers at an input of the neural network, the second network portion comprises a remaining portion of the neural network, and extracting the one or more features of the set of input data comprises: performing a first operation on the set of input data using the first network portion to generate an intermediate set of data that indicates the one or more features of the set of input data, wherein the first input data cluster of the multiple input data clusters is selected based on the intermediate set of data.
5 . The processor-implemented method of claim 4 , wherein performing the inferencing operation comprises performing a second operation on the intermediate set of data using the second network portion with the selected set of quantization scales for layers of the second network portion.
6 . The processor-implemented method of claim 4 , further comprising loading a predetermined set of one or more quantization scales for the one or more layers of the first network portion, wherein the first operation is performed on the set of input data using the first network portion with quantization based on the predetermined set of one or more quantization scales.
7 . The processor-implemented method of claim 1 , further comprising loading the selected set of quantization scales from the memory to layers of the neural network before performing the inferencing operation.
8 . The processor-implemented method of claim 1 , wherein the multiple sets of quantization scales include a respective set of quantization scales that has been determined for each input data cluster of the multiple input data clusters based on results of processing multiple sets of training data using the neural network.
9 . The processor-implemented method of claim 1 , wherein performing the inferencing operation on the set of input data comprises quantizing an output data set of a respective layer of the neural network from floating-point values to integer values having a reduced number of bits based on at least one quantization scale of the selected set of quantization scales for the respective layer of the neural network.
10 . The processor-implemented method of claim 9 , wherein the quantization of the output data set of the respective layer comprises an asymmetric quantization or a symmetric quantization.
11 . The processor-implemented method of claim 1 , wherein the neural network comprises a set of instructions compiled for one or more processors and/or accelerators and stored in a non-transitory storage medium, wherein the set of instructions, when executed by the one or more processors and/or accelerators, cause the one or more processors and/or accelerators to perform the method of claim 1 .
12 . An integrated circuit, comprising: one or more processors; and memory storing data and instructions, wherein the data comprises multiple sets of quantization scales, and parameters of a neural network, and wherein the instructions, when executed by the one or more processors, cause the one or more processors to: provide a set of input data to an input layer of the neural network; select, from the multiple sets of quantization scales, a set of quantization scales for the neural network and the set of input data, based on an analysis of the set of input data; and perform an inferencing operation on the set of input data using the neural network, the inferencing operation including quantization based on the selected set of quantization scales.
13 . The integrated circuit of claim 12 , wherein the instructions further cause the one or more processors to: extract, from the set of input data, one or more features of the set of input data; and classify the set of input data into a first input data cluster of multiple input data clusters based on at least a portion of the one or more features of the set of input data, wherein each of the multiple sets of quantization scales is associated with a respective one of the multiple input data clusters.
14 . The integrated circuit of claim 13 , wherein the neural network includes a first network portion and a second network portion, the first network portion comprises one or more layers, including the input layer, at an input of the neural network, and the second network portion comprises a remaining portion of the neural network, wherein the instructions further cause the one or more processors to: perform a first operation on the set of input data using the first network portion to generate an intermediate set of data that indicates or includes the one or more features of the set of input data, wherein the set of input data is classified into the first input data cluster based on the intermediate set of data; and load the selected set of quantization scales from the memory to respective layers of the second network portion.
15 . The integrated circuit of claim 14 , wherein the instructions to perform the inferencing operation comprise instructions to perform operations on the intermediate set of input data using the second network portion with the selected set of quantization scales that are loaded to the respective layers of the second network portion.
16 . The integrated circuit of claim 15 , wherein the instructions further cause the one or more processors to: load a predetermined set of one or more quantization scales to the one or more layers of the first network portion, wherein the first operation is performed on the set of input data using the first network portion with quantization based on the predetermined set of one or more quantization scales.
17 . The integrated circuit of claim 13 , wherein the multiple sets of quantization scales include a respective set of quantization scales that has been determined for each input data cluster of the multiple input data clusters based on results of processing multiple sets of training data using the neural network.
18 . The integrated circuit of claim 12 , wherein the one or more processors include an accelerator, and the neural network comprises a set of instructions compiled for the accelerator and stored in the memory.
19 . The integrated circuit of claim 12 , wherein the instructions further cause the one or more processors to: quantize an output data set of a respective layer of the neural network from floating-point values to integer values having a reduced number of bits based on a quantization scale of the selected set of quantization scales for the respective layer of the neural network.
20 . The integrated circuit of claim 19 , wherein the quantization of the output data set of the respective layer comprises an asymmetric quantization or a symmetric quantization.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of and priority to U.S. provisional patent application No. 63/715,687, filed on Nov. 4, 2024, and entitled “Quantization for Neural Network,” which is incorporated herein by reference in its entirety. TECHNICAL FIELD This disclosure relates to machine learning models, such as neural networks, and, more specifically, to conditional quantization for neural networks or other machine learning models. BACKGROUND Neural networks are directed acyclic graphs. Data flows on edges between nodes which perform various operations. Floating-point computations and fixed-point computations may be employed when implementing a neural network. Converting between a higher precision floating point operation and lower precision fixed point operation via an affine transformation and rounding operation is a process known as quantization. Quantization can allow the layers of the neural network to perform fixed point computations, which can be converted (or dequantized) back to floating-point data at the output of the neural network. Existing methods of quantization, such as static or dynamic quantization, may result in accuracy loss and/or higher computational costs. SUMMARY One example relates to a processor-implemented method that includes receiving a set of input data to a neural network. The method also includes selecting, from multiple sets of quantization scales for the neural network, a set of quantization scales for the neural network and the set of input data. Each set of the multiple sets of quantization scales is stored in memory and is associated with a respective input data cluster of multiple input data clusters. The method also includes performing an inferencing operation on the set of input data using the neural network with the selected set of quantization scales. Another example relates to an integrated circuit that includes one or more processors and memory. The memory can store data and instructions, in which the data includes multiple sets of quantization scales, and parameters of a neural network. The instructions, when executed by the one or more processors, cause the one or more processors to provide a set of input data to an input layer of the neural network and select, from the multiple sets of quantization scales, a set of quantization scales for the neural network and the set of input data, based on an analysis of the set of input data. The instructions can further cause the one or more processors to perform an inferencing operation on the set of input data using the neural network, the inferencing operation including quantization based on the selected set of quantization scales. Yet another example relates to a processor-implemented method that includes providing multiple sets of input data to a neural network. The method also includes determining, for each set of input data of the multiple sets of input data, a respective set of quantization scales for layers of the neural network. The method also includes clustering the multiple sets of input data into multiple data clusters based on the respective sets of quantization scales for the multiple sets of input data. The method also includes determining, for each data cluster of the multiple data clusters, a respective set of cluster quantization scales that includes quantization scales for the layers of the neural network. The respective set of cluster quantization scales for each data cluster of the multiple data clusters can be stored in memory for use during inferencing. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of an example of a neural network system that includes conditional quantization. FIG. 2 is a block diagram of another example of a neural network system that includes conditional quantization. FIG. 3 is a block diagram of an example of a semiconductor device that can implement a neural network that includes conditional quantization. FIG. 4 is a flow diagram illustrating an example of a method for processing a set of input data using a neural network that includes conditional quantization. FIG. 5 is a flow diagram illustrating an example of a method to determine sets of quantization scales and train a cluster predictor for a neural network. FIG. 6 is a graph illustrating part of an example neural network. FIGS. 7-11 are tables demonstrating examples of data at various stages of the method of FIG. 5. FIG. 12 is a flow diagram illustrating an example of a method for determining quantization scales for layers of a neural network. FIG. 13 is a flow diagram illustrating an example of a clustering method that can be used for determining quantization scales for conditional quantization in a neural network. FIG. 14 is a diagram illustrating an example of determining conditional quantization scales at a layer of a neural network that may be performed in the method of FIG. 13. FIG. 15 is a table illustrating an example of clustering performed according to the meth