EP-4740138-A1 - APPROXIMATING ACTIVATION FUNCTIONS WITH TAYLOR SERIES

EP4740138A1EP 4740138 A1EP4740138 A1EP 4740138A1EP-4740138-A1

Abstract

An activation function unit can compute activation functions approximated by Taylor series. The activation function unit may include a plurality of compute elements. Each compute element may include two multipliers and an accumulator. The first multiplier may compute intermediate products using an activation, such as an output activation of a DNN layer. The second multiplier may compute terms of Taylor series approximating an activation function based on the intermediate products from the first multiplier and coefficients of the Taylor series. The accumulator may compute a partial sum of the terms as an output of the activation function. The number of the terms may be determined based on a predetermined accuracy of the output of the activation function. The activation function unit may process multiple activations. Different activations may be input into different compute elements in different clock cycles. The activation function unit may compute activation functions with different accuracies.

Inventors

Cheema, Umer Iftikhar
MATHAIKUTTY, DEEPAK ABRAHAM
RAHA, Arnab
KONDRU, DINAKAR
SUNG, RAYMOND JIT-HUNG
GHOSH, SOUMENDU KUMAR

Assignees

INTEL Corporation

Dates

Publication Date: 20260513
Application Date: 20231207

Claims (20)

Claims 1. A compute element for computing an activation function, the compute element comprising: a first multiplier configured to compute one or more intermediate products using an activation, the activation computed in a layer of a neural network; a second multiplier configured to compute one or more terms of an approximation of the activation function based on the one or more intermediate products from the first multiplier and one or more coefficients of the approximation; and an accumulator configured to compute an output of the activation function based on a polynomial comprising the one or more terms of the approximation, wherein a degree of the polynomial is determined based on a predetermined accuracy of the output of the activation function.
2. The compute element of claim 1, wherein the approximation of the activation function is a Taylor series, and the one or more coefficients of the approximation comprises one or more coefficients of the Taylor series that are computed before the activation is computed.
3. The compute element of claim 1 or 2, further comprising: a storage unit associated with the accumulator, the storage unit configured to store an intermediate sum computed by the accumulator.
4. The compute element of claim 3, wherein the accumulator is configured to compute the output of the activation function by accumulating the intermediate sum with a term of the approximation computed by the second multiplier.
5. The compute element of any one of claims 1-4, wherein the first multiplier is configured to compute the one or more intermediate products by: computing a first intermediate product in a first clock cycle; and after computing the first intermediate product, computing a second intermediate product in a second clock cycle based on the activation and the first intermediate product.
6. The compute element of any one of claims 1-5, wherein the second multiplier is configured to compute the one or more terms of the approximation in a sequence of clock cycles, and the second multiplier is configured to use a different coefficient of the approximation in each clock cycle in the sequence.
7. The compute element of any one of claims 1-6, wherein: the compute element is included in a plurality of compute elements for computing outputs of the activation function using a plurality of activations, and the plurality of activations is computed in the layer of the neural network and comprises the activation.
8. The compute element of claim 7, wherein the plurality of activations is input into different ones of the plurality of compute elements in different clock cycles.
9. The compute element of claim 8, wherein a first output of the activation function based on a first activation of the plurality of activations has a higher predetermined accuracy than a second output of the activation function based on a second activation of the plurality of activations.
10. The compute element of claim 9, wherein the first output of the activation function is computed by more compute elements than the second output of the activation function.
11. An apparatus for a deep learning operation, the apparatus comprising: one or more processing elements configured to computing one or more activations by performing the deep learning operation in a neural network; a memory configured to store one or more coefficients of an approximation of an activation function in the neural network; and one or more compute elements configured to receive the one or more activations from the one or more processing elements and receive the one or more coefficients from the memory, a compute element comprising: a first multiplier configured to compute one or more intermediate products using an activation of the one or more activations, a second multiplier configured to compute one or more terms of the approximation based on the one or more intermediate products from the first multiplier and the one or more coefficients, and an accumulator configured to compute an output of the activation function based on a polynomial comprising the one or more terms of the approximation, wherein a degree of the polynomial is determined based on a predetermined accuracy of the output of the activation function.
12. The apparatus of claim 11, wherein the one or more processing elements are coupled to the memory through a data transfer path, and the compute element is on the data transfer path.
13. The apparatus of claim 11 or 12, wherein the first multiplier is configured to compute the one or more intermediate products by: computing a first intermediate product in a first clock cycle; and after computing the first intermediate product, computing a second intermediate product in a second clock cycle based on the activation and the first intermediate product.
14. The apparatus of any one of claims 11-13, wherein the second multiplier is configured to compute the one or more terms of the approximation in a sequence of clock cycles.
15. The apparatus of claim 14, wherein the second multiplier is configured to use a different coefficient of the approximation in each clock cycle in the sequence.
16. The apparatus of any one of claims 11-15, wherein different ones of the one or more activations are input into different ones of the one or more compute elements in different clock cycles.
17. The apparatus of claim 16, wherein: a first output of the activation function based on a first activation of the one or more activation has a higher predetermined accuracy than a second output of the activation function based on a second activation of the one or more activation, and the first output of the activation function is computed by more compute elements than the second output of the activation function.
18. The apparatus of any one of claims 11-17, wherein: the deep learning operation is in a first layer of the neural network, and the output of the activation function is input into a second layer of the neural network.
19. The apparatus of claim 18, wherein the second layer is after the first layer in the neural network.
20. A method for deep learning, comprising: receiving one or more precomputed coefficients of an approximation of an activation function in a neural network; receiving an activation computed in a layer of the neural network; computing, by a first multiplier, one or more intermediate products using the activation; computing, by a second multiplier, one or more terms of the approximation based on the one or more intermediate products from the first multiplier and the one or more precomputed coefficients; and computing, by an accumulator, an output of the activation function based on a polynomial comprising the one or more terms of the approximation, wherein a degree of the polynomial is determined based on a predetermined accuracy of the output of the activation function.

Description

APPROXIMATING ACTIVATION FUNCTIONS WITH TAYLOR SERIES Cross-Reference to Related Application [0001] This application claims the benefit of priority from U.S. Patent Application No. 18/346,992, filed on July 05, 2023, and entitled, “APPROXIMATING ACTIVATION FUNCTIONS WITH TAYLOR SERIES,” which is hereby incorporated by reference herein in its entirety. Technical Field [0002] This disclosure relates generally to deep neural networks (DNN), and more specifically, approximating activation functions in DNNs with Taylor series. Background [0003] DNNs are used extensively for a variety of artificial intelligence applications ranging from computer vision to speech recognition and natural language processing due to their ability to achieve high accuracy. However, the high accuracy comes at the expense of significant computation cost. DNNs have extremely high computing demands as each inference can require hundreds of millions of MAC (multiply-accumulate) operations as well as a large amount of data to read and write. DNN inference also requires computation of activation functions. Therefore, techniques to improve efficiency of DNNs are needed. Brief Description of the Drawings [0004] Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. [0005] FIG.1 illustrates an example DNN, in accordance with various embodiments. [0006] FIG.2 illustrates an example convolution, in accordance with various embodiments. [0007] FIG.3 is a block diagram of a DNN accelerator, in accordance with various embodiments. [0008] FIG.4 illustrates an example activation function unit, in accordance with various embodiments. [0009] FIG.5 illustrates an example operation of the activation function unit with no stall cycle, in accordance with various embodiments. [0010] FIG.6 illustrates computation of an activation function based on the same accuracy, in accordance with various embodiments. [0011] FIG.7 illustrates an example operation of the activation function unit with a stall cycle, in accordance with various embodiments. [0012] FIG.8 illustrates computation of an activation function based on different accuracies, in accordance with various embodiments. [0013] FIG.9 illustrates a processing element (PE) array, in accordance with various embodiments. [0014] FIG.10 is a block diagram of a PE, in accordance with various embodiments. [0015] FIG.11 is a flowchart showing a method of computing activation functions in neural networks, in accordance with various embodiments. [0016] FIG.12 is a block diagram of an example computing device, in accordance with various embodiments. Detailed Description Overview [0017] The last decade has witnessed a rapid rise in AI (artificial intelligence) based data processing, particularly based on DNNs. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. The significant improvements in DNN model size and accuracy coupled with the rapid increase in computing power of execution platforms have led to the adoption of DNN applications even within resource constrained mobile and edge devices that have limited energy availability. [0018] A DNN layer may include one or more deep learning operations, such as convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on. A deep learning operation in a DNN may be performed on one or more internal parameters of the DNNs (e.g., weights), which are determined during the training phase, and one or more activations. An activation may be a data point (also referred to as “data elements” or “elements”). Activations or weights of a DNN layer may be elements of a tensor of the DNN layer. A tensor is a data structure having multiple elements across one or more dimensions. Example tensors include a vector, which is a one-dimensional tensor, and a matrix, which is a two-dimensional tensor. There can also be three-dimensional tensors and even higher dimensional tensors. A DNN layer may have an input tensor (also referred to as “input feature map (IFM)”) including one or more input activations (also referred to as “input elements”) and a weight tensor including one or more weights. A weight is an element in the weight tensor. A weight tensor of a convolution may be a kernel, a filter, or a group of filters. The output data of the DNN layer may be an output tensor (also referred to as “output feature map (OFM)”) that includes one or more output activations (also referred to as “output elements”). [0019] Activation functions are important parts of DNNs. An activation function can decide whether a neuron should or should not be activated by computing the weighte