EP-4738197-A1 - NEURAL NETWORK ACCELERATOR PERFORMING OPERATION WITH MIXED-FORMAT WEIGHTS

EP4738197A1EP 4738197 A1EP4738197 A1EP 4738197A1EP-4738197-A1

Abstract

A data processing unit may include a memory, processing elements (PEs), and a control unit. The memory may store weight blocks within a weight tensor of a neural network operation. Each weight block has an input channel (IC) dimension and an output channel (OC) dimension and includes subblocks. A subblock includes one or more weights having a first data precision and one or more other weights having a second data precision. The second data precision is lower than the first data precision. The control unit may distribute different ones of the subblocks to different ones of the PEs. A PE may receive a subblock and perform a first MAC operation on a weight having a first data precision and a second MAC operation on a weight having a second data precision. The first MAC operation may consume more computation cycles or more multipliers than the second MAC operation.

Inventors

RAHA, Arnab
MATHAIKUTTY, DEEPAK ABRAHAM
WU, MICHAEL
Sharma, Daksha
LANGHAMMER, MARTIN

Assignees

Altera Corporation

Dates

Publication Date: 20260506
Application Date: 20250825

Claims (15)

An apparatus, comprising: a memory configured to store a weight block of a neural network operation, the weight block comprising weights having different data precisions; one or more processing elements, a processing element comprising a multiply-accumulate (MAC) unit; and a control unit configured to distribute the weights to the one or more processing elements, wherein the processing element is configured to perform a first MAC operation on a weight having a first data precision and a second MAC operation on a weight having a second data precision, and the second data precision is lower than the first data precision.
The apparatus of claim 1, wherein the weight block comprises a plurality of subblocks, a subblock comprises one or more weights having the first data precision and one or more other weights having the second data precision, and the one or more weights and one or more other weights are all in different input channels of the neural network operation.
The apparatus of claim 2, wherein the one or more weights and one or more other weights are all in a same output channel of the neural network operation.
The apparatus of claim 2 or 3, wherein the subblocks have a same number of weights having the second data precision.
The apparatus of any one of claims 1-4, wherein the first MAC operation is performed in more computation cycles than the second MAC operation.
The apparatus of claim 5, wherein the MAC unit comprises a multiplier, a shifter, and an adder.
The apparatus of claim 6, wherein the multiplier is configured to compute a first product and a second product in two computation cycles, respectively, for the first MAC operation, the shifter is configured to shift the first product, and the adder is configured to add an output of the shifter with the second product.
The apparatus of any one of claims 1-7, wherein the processing elements comprises a plurality of multipliers, and the first MAC operation is performed by using more multipliers than the second MAC operation.
The apparatus of claim 8, wherein the first MAC operation is performed in a first computation cycle, the second MAC operation is performed in a second computation cycle, and the control unit is configured to distribute more activations to the processing element for the second computation cycle than the first computation cycle.
The apparatus of any one of claims 1-9, wherein the first MAC operation or the second MAC operation is performed further on an input activation of the neural network operation, and the input activation has the first data precision.
A method of executing a neural network, the method comprising: storing a weight block of a neural network operation, the weight block comprising weights having different data precisions; distributing the weights to one or more processing elements; and performing, by a processing element, a first MAC operation on a weight having a first data precision and a second MAC operation on a weight having a second data precision, the second data precision lower than the first data precision.
The method of claim 11, wherein the weight block comprises subblocks, a subblock comprises one or more weights having the first data precision and one or more other weights having the second data precision, and the one or more weights and one or more other weights are all in different input channels of the neural network operation.
The method of claim 11 or 12, wherein: the first MAC operation is performed in more computation cycles than the second MAC operation; or the processing element comprises a plurality of multipliers, the first MAC operation is performed by using more multipliers than the second MAC operation, and the method further comprises distributing more activations to the processing element for the second MAC operation than the first MAC operation.
The method of any one of claims 11-13, wherein performing the first MAC operation comprises: computing, by a multiplier in the processing element, a first product and a second product in two computation cycles, respectively; shifting, by a shifter in the processing element, the first product; and accumulating, by an adder in the processing element, an output of the shifter with the second product.
One or more non-transitory computer-readable media storing instructions executable to perform the metho of any one of claims 11-14.

Description

Technical Field This disclosure relates generally to neural networks (also referred to as "deep neural networks" or "DNN"), and more specifically, DNN accelerators that perform operations in DNNs with mixed-format weights. Background DNNs are used extensively for a variety of artificial intelligence (Al) applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as there can be a large number of operations as well as a large amount of data to read and write. Therefore, techniques to improve efficiency of DNNs are needed. Brief Description of the Drawings Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. FIG. 1 illustrates an example DNN, in accordance with various embodiments.FIG. 2 illustrates an example convolution, in accordance with various embodiments.FIG. 3 is a block diagram of a DNN system, in accordance with various embodiments.FIG. 4 is a block diagram of a DNN module, in accordance with various embodiments.FIG. 5 illustrates an example sparse cell, in accordance with various embodiments.FIG. 6 illustrates an example sparse cell array, in accordance with various embodiments.FIG. 7 illustrates an example processing element (PE), in accordance with various embodiments.FIG. 8 illustrates a computation schedule for a group of PEs, in accordance with various embodiments.FIG. 9 illustrates a mixed-format map, in accordance with various embodiments.FIG. 10 illustrates an example PE with an 8×4 multiplier, in accordance with various embodiments.FIG. 11 illustrates an example PE that can perform computations with mixed-format weights, in accordance with various embodiments.FIG. 12 illustrates another example PE that can perform computations with mixed-format weights, in accordance with various embodiments.FIG. 13 illustrates an example bitmap used for accelerating computations in a PE, in accordance with various embodiments.FIG. 14 illustrates mixed-format maps of different patterns, in accordance with various embodiments.FIG. 15 illustrates an example search tree, in accordance with various embodiments.FIG. 16 is a flowchart of a method of executing a DNN, in accordance with various embodiments.FIG. 17 is a block diagram of an example computing device, in accordance with various embodiments. Detailed Description Overview The last decade has witnessed a rapid rise in AI based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. A DNN typically includes a sequence of layers. A DNN layer may include one or more operations, such as convolution, interpolation, layer normalization, batch normalization, SoftMax operation, pooling, elementwise operation, linear operation, nonlinear operation, and so on. These operations are referred to as deep learning operations or neural network operations. Neural network operations may be tensor operations. Input or output data of neural network operations may be arranged in data structures called tensors. Taking a convolutional layer for example, the input tensors include an activation tensor (also referred to as "input feature map (IFM)" or "input activation tensor") including one or more activations (also referred to as "input elements") and a weight tensor. The weight tensor may be a kernel (a 2D weight tensor), a filter (a 3D weight tensor), or a group of filters (a 4D weight tensor). A convolution may be performed on the input activation tensor and weight tensor to compute an output activation tensor in the convolutional layer. A tensor is a data structure having multiple elements across one or more dimensions. Examples of tensors include vector (which is one-dimensional (1D) tensor), matrix (which is two-dimensional (2D) tensor), three-dimensional (3D) tensors, four-dimensional (4D) tensors, and even higher dimensional tensors. A dimension of a tensor may correspond to an axis, e.g., an axis in a coordinate system. A dimension may be measured by the number of data points along the axis. The dimensions of a tensor may define the shape of the tensor. A DNN layer may receive one or more input tensors and compute an output tensor from the one or more input tensors. In some embodiments, a 3D tensor may have an X-dimension, a Y-dimension, and Z-dimension. The X-dimension of a tensor may be the horizontal dimension, the length of which may be the width of the tensor; the Y-dimension may be the vertical dimension, the length of wh