US-12619854-B2 - Neural network inference quantization

US12619854B2US 12619854 B2US12619854 B2US 12619854B2US-12619854-B2

Abstract

One or more computer processors responsive to neural network run-time, reduce one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a neural network. The one or more computer processors compute an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network. The one or more computer processors clip one or more activations based on respective computed IQR clip thresholds. The one or more computer processors quantize the clipped activations.

Inventors

YOUSEF EL-KURDI
Jerome L Quinn
Robert Todd Ward

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260505
Application Date: 20220223

Claims (11)

1 . A computer-implemented method comprising: responsive to neural network inference, reducing, by one or more computer processors, one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a quantized neural network; computing, by one or more computer processors, an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network; clipping, by one or more computer processors, one or more activations based on respective computed IQR clip thresholds into a plurality of quantized partitions; and responsive to a second feed-forward general matrix multiply (GEMM) operation utilizing a single instruction multiple data (SIMD) instruction set, applying, by one or more computer processors, the clipped and quantized activations to one or more input activations of the operation.
2 . The computer-implemented method of claim 1 , wherein computing IQR clip threshold for each reduced set for each sequence dimension in the neural network, further comprises: row-sorting, by one or more computer processors, the one or more reduced sets of activation based on row-based maximum activations; and computing, by one or more computer processors, the IQR clip threshold as q3+1.5(q3−q1), wherein q3 is a median value of 75% of row-wise outlier activations and q1 is a median value of 25% of row-wise outlier activations.
3 . The computer-implemented method of claim 2 , further comprising: isolating, by one or more computer processors, activation outlier values and rows comprised in an activation matrix into a plurality of partitions utilizing the computed IQR clip threshold, wherein each partition in the plurality of partitions is quantized utilizing separate 8-bit quantization reducing sensitivity of quantization scale conforming with outlier activations.
4 . The computer-implemented method of claim 1 , further comprising: initiating, by one or more computer processors, one or more natural language tasks utilizing the quantized neural network.
5 . A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to responsive to neural network inference, reduce one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a quantized neural network; program instructions to compute an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network; program instructions to clip one or more activations based on respective computed IQR clip thresholds into a plurality of quantized partitions; and program instructions to, responsive to a second feed-forward general matrix multiply (GEMM) operation utilizing a single instruction multiple data (SIMD) instruction set, apply the clipped and quantized activations to one or more input activations of the operation.
6 . The computer program product of claim 5 , wherein the program instructions, to compute IQR clip threshold for each reduced set for each sequence dimension in the neural network, comprise: program instructions to row-sort the one or more reduced sets of activation based on row-based maximum activations; and program instructions to compute the IQR clip threshold as q3+1.5(q3−q1), wherein q3 is a median value of 75% of row-wise outlier activations and q1 is a median value of 25% of row-wise outlier activations.
7 . The computer program product of claim 6 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to isolate activation outlier values and rows comprised in an activation matrix into a plurality of partitions utilizing the computed IQR clip threshold, wherein each partition in the plurality of partitions is quantized utilizing separate 8-bit quantization reducing sensitivity of quantization scale conforming with outlier activations.
8 . The computer program product of claim 5 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to initiate one or more natural language tasks utilizing the quantized neural network.
9 . A computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the stored program instructions comprising: program instructions to responsive to neural network inference, reduce one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a quantized neural network; program instructions to compute an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network; program instructions to clip one or more activations based on respective computed IQR clip thresholds into a plurality of quantized partitions; and program instructions to, responsive to a second feed-forward general matrix multiply (GEMM) operation utilizing a single instruction multiple data (SIMD) instruction set, apply the clipped and quantized activations to one or more input activations of the operation.
10 . The computer system of claim 9 , wherein the program instructions, to compute IQR clip threshold for each reduced set for each sequence dimension in the neural network, comprise: program instructions to row-sort the one or more reduced sets of activation based on row-based maximum activations; and program instructions to compute the IQR clip threshold as q3+1.5(q3−q1), wherein q3 is a median value of 75% of row-wise outlier activations and q1 is a median value of 25% of row-wise outlier activations.
11 . The computer system of claim 10 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to isolate activation outlier values and rows comprised in an activation matrix into a plurality of partitions utilizing the computed IQR clip threshold, wherein each partition in the plurality of partitions is quantized utilizing separate 8-bit quantization reducing sensitivity of quantization scale conforming with outlier activations.

Description

BACKGROUND The present invention relates generally to the field of machine learning, and more particularly to neural network inference quantization. Neural networks (NNs) are computing systems inspired by biological neural networks. NNs are not simply algorithms, but rather a framework for machine learning algorithms to process complex inputs. Such systems “learn” to perform tasks by considering training examples, generally without being programmed with any task-specific rules. NNs are based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain where each artificial neuron can transmit a signal from one artificial neuron to another. Here, an artificial neuron receives a signal which said artificial neuron can process the signal and, subsequently, transfer the signal to additional artificial neurons. In common NN implementations, a signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by a non-linear function of the sum of the inputs. Artificial neurons and edges, typically, have respective weights that adjust as learning proceeds. The respective weights increase or decrease a strength of a signal at a connection. Artificial neurons may have a threshold such that a signal is only sent if an aggregate signal crosses the threshold. Typically, artificial neurons are aggregated into layers where a plurality of layers perform a plurality of data transformations on inputs. SUMMARY Embodiments of the present invention disclose a computer-implemented method, a computer program product, and a system. The computer-implemented method includes one or more computer processers responsive to neural network run-time, reducing one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a neural network. The one or more computer processors compute an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network. The one or more computer processors clip one or more activations based on respective computed IQR clip thresholds. The one or more computer processors quantize the clipped activations. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a functional block diagram illustrating a computational environment, in accordance with an embodiment of the present invention; FIG. 2 is a flowchart depicting operational steps of a program, on a server computer within the computational environment of FIG. 1, for neural network quantization with token-maximums interquartile range clipping, in accordance with an embodiment of the present invention; FIG. 3 illustrates exemplary algorithmic steps of the program within the computational environment of FIG. 1, in accordance with an embodiment of the present invention; FIG. 4 is an exemplary table, in accordance with an embodiment of the present invention; FIG. 5 is an exemplary table, in accordance with an embodiment of the present invention; and FIG. 6 is a block diagram of components of the server computer, in accordance with an embodiment of the present invention. DETAILED DESCRIPTION Transformer-based Neural Networks (NN) such as Bidirectional Encoder Representations from Transformers (BERT), and XLM-R, pre-trained on large amounts of data, have led to state-of-the-art (SOTA) results on many natural language processing (NLP) tasks such as machine translation, text classification, and question answering. However, run-time inference of such large models is very costly due to large computational requirements. In addition, deploying these models on smaller footprint computing devices (e.g., mobile devices) or cost-effective central processing unit (CPU) based machines require aggressive optimization techniques for both speed and network size. One popular optimization technique is NN quantization where network weights and activations are transformed from 32-bit floating-point representations to integers (e.g., 8-bit). Running inference using integer operations has two key advantages. First, the model size footprint is considerably reduced, for example, 8-bit quantization shrinks models by a factor of four. Second, inference throughput is significantly increased by using more efficient integer-based single instruction multiple data (SIMD) instructions while improving memory bandwidth utilization, typically a bottleneck limiting computational throughput. Although, fundamentally, the techniques discussed above lead to a quantitative loss of information due to lowered numerical precision, as a result, applying integer quantization directly to NN models results in a considerable drop in accuracy. The majority of quantization research involves a mix of quantization-aware training (QAT) and post-training calibration techniques with varying complexities to resolve the quantization performance gap. While these techniques typically cl