EP-4735992-A1 - COMPUTATIONALLY EFFICIENTLY DISCRETIZING FLOATING POINT NUMBERS

EP4735992A1EP 4735992 A1EP4735992 A1EP 4735992A1EP-4735992-A1

Abstract

A computation engine includes a discretizing unit for transforming a floating point number (X) with an exponent having a first number of bits (e) and a mantissa with a second number of bits (m). The number (X) is a product of its mantissa value (V MX ) and 2 raised to its exponent value (V EX ). The unit performs the following: it computes a difference exponent value (V EA ) by subtracting a reference exponent value (V EY ) from V EX . If V EA is less than 0, the output is 0. If V EA is greater than or equal to 0, it outputs a discretized number with an exponent equal to V EA and a mantissa derived from the n most significant bits of X, where n equals V EA .

Inventors

BAMBERG, Lennart
POURTAHERIAN, Arash
Waeijen, Luc Johannes Wilhelmus
PIRES DOS REIS MOREIRA, ORLANDO MIGUEL

Assignees

Snap Inc.

Dates

Publication Date: 20260506
Application Date: 20240628

Claims (20)

1. A computation engine comprising a discretizing unit for discretizing a floating point number (X) to provide a discretized floating point number, the floating point number (X) to be discretized having an exponent component with a first number (e) of bits and a mantissa component with a second number (m) of bits, the exponent component having an exponent value and the mantissa component having a mantissa value, the floating point number (X) to be discretized having a value equal to a product of the mantissa value (VMX) and the value 2 raised to the exponent value (VEX), the discretizing unit being configured to: a) compute a difference exponent value (VE ) of the discretized floating point number by subtracting a reference exponent value (VEY) from the exponent value (VEX) of the floating point number to be discretized (X); b) output a value 0 as the discretized floating point value (Discretize(X,Y)) of the discretized floating point number if the difference exponent value (VEA) is less than 0; c) if the difference exponent value (VEA) is greater than or equal to 0 output a value of the discretized floating point number having an exponent component equal to the difference exponent value (VEA) and having a mantissa component of which the n most significant bits are equal to the n most significant bits of the floating point number (X) to be discretized, where the value of n is equal to the difference exponent value (VEA).
2. The computation engine according to claim 1, configured to set to 0 any bits of the mantissa of the discretized floating point number other than the most significant bits.
3. The computation engine according to claim 1, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
4. The computation engine according to claim 2, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
5. A computing device configured to perform neural network operations of a neural network, the computing device comprising a computation engine according to claim 1, the computation engine being configured to apply a discretization operation to floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
6. The computation device according to claim 5, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
7. A computing device configured to perform neural network operations of a neural network, the computing device comprising a computation engine according to claim 2, the computation engine being configured to apply a discretization operation to floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
8. A computing device configured to perform neural network operations of a neural network, the computing device comprising a computation engine according to claim 4, the computation engine being configured to apply a discretization operation to floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
9. A method to be performed by a computation engine for discretizing a floating point number (X) having a mantissa component with a first number ( m ) of bits and an exponent component with a second number (e) of bits, the exponent component having an exponent value and the mantissa component having a mantissa value, the floating point number (X) to be discretized having a value equal to a product of the mantissa value (VMX) and the value 2 raised to the exponent value (VEX), the method comprising: a) computing (Si) an exponent difference value (SEA,) as the difference between the value (VEX) of the exponent component of the floating point number to be discretized (X) and a reference exponent component value (VEY); b) outputting a value 0 for the discretized floating point number (Discretize(X,Y)) if the exponent difference value (SEA,) is less than 0; c) if the difference exponent value (VE ) is greater than or equal to 0 outputting a value of the discretized floating point number having an exponent component equal to the difference exponent value (VEA) and having a mantissa component of which the n most significant bits are equal to the n most significant bits of the floating point number (X) to be discretized, where the value of n is equal to the difference exponent value (VEA).
10. The method according to claim 9, comprising setting to 0 any bits other than the most significant bits of the mantissa of the discretized floating point number to 0.
11. The method according to claim 9, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
12. The method according to claim 10, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
13. A method of performing neural network operations of a neural network, according to claim 9, comprising discretizing floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
14. A method of performing neural network operations of a neural network, according to claim 10, comprising discretizing floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
15. A method of performing neural network operations of a neural network, according to claim 11, comprising discretizing floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
16. A method of performing neural network operations of a neural network, according to claim 12, comprising discretizing floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.
17. A tangible computer-readable medium having computer-executable instructions stored thereon that, when executed by a processor, perform a method for discretizing a floating point number (X), wherein the floating point number (X) includes a mantissa component having a first number (m) of bits and an exponent component having a second number (e) of bits, the exponent component having an exponent value and the mantissa component having a mantissa value, the floating point number (X) to be discretized having a value equal to a product of the mantissa value (VMX) and the value 2 raised to the exponent value (VEX), the method comprising: a) computing (Si) an exponent difference value (SEA,) as the difference between the value (VEX) of the exponent component of the floating point number to be discretized (X) and a reference exponent component value (VEY); b) outputting a value 0 for the discretized floating point number (Discretize(X,Y)) if the exponent difference value (SEA,) is less than 0; c) if the difference exponent value (VE ) is greater than or equal to 0 outputting a value of the discretized floating point number having an exponent component equal to the difference exponent value (VEA) and having a mantissa component of which the n most significant bits are equal to the n most significant bits of the floating point number (X) to be discretized, where the value of n is equal to the difference exponent value (VEA).
18. The tangible computer-readable medium according to claim 17, wherein the method to be executed by the processor comprises setting to 0 any bits other than the most significant bits of the mantissa of the discretized floating point number to 0.
19. The tangible computer-readable medium according to claim 17, wherein the floating point number (X) to be discretized is specified in the FP16 number format.
20. The tangible computer-readable medium according to claim 17, wherein the method to be executed by the processor comprises discretizing floating point data selected from the group consisting of: floating point data to be exchanged between neurons in the neural network; floating point data representing neuron states; floating point data of a feature map.

Description

COMPUTATIONALLY EFFICIENTLY DISCRETIZING FLOATING POINT NUMBERS CLAIM OF PRIORITY This application claims the benefit of priority to European Patent Application Serial No. 23306051.6, filed on June 28, 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure pertains to a computation engine for discretizing floating point numbers. The present disclosure further pertains to a computing device configured to perform neural network operations of a neural network. The present disclosure still further pertains to a computation method for discretizing floating point numbers. The present disclosure also pertains to a neural network method configured to perform neural network operations of a neural network, therewith using the computation method. BACKGROUND For many applications it is desirable to use a floating point format in view of its large value range. An exemplary application is neural network computing. However, developments in neural network technology rather tend to result in more and more complex neural networks with more layers and more neural network operations to be performed. There is a need to mitigate the computational effort involved in these operations to render it possible that also these more complex neural networks can be performed with modest computational means. SUMMARY According to a first aspect of the present disclosure a computation engine for discretizing floating point numbers is provided herein. According to a second aspect a computing device comprising a computation engine for performing a discretization operations for the purpose of neural network processing. According to a third aspect of the present disclosure a computation method for discretizing floating point numbers engine is provided herein. The present disclosure further pertains to a tangible or non-transitory computer-readable medium having computer-executable instructions stored thereon that, when executed by a processor, perform the computation method. According to a fourth aspect of the present disclosure a computation method for discretizing floating point numbers for the purpose of neural network operations is provided herein. The present disclosure further pertains to a tangible or non-transitory computer-readable medium having computer-executable instructions stored thereon that, when executed by a processor, perform the neural network method. The computation engine according to the first aspect comprises a discretizing unit for discretizing a floating point number in order to provide a discretized floating point number. Floating point numbers comprise an exponent component with a first number of bits and a mantissa component with a second number of bits. Optionally floating point numbers also have a sign bit. The exponent component has an exponent value and the mantissa component has a mantissa value. The floating point number to be discretized has a value equal to a product of the mantissa value and the value 2 raised to the exponent value. A sign bit if included indicates whether the floating point number has a positive or a negative value. For neural network applications the FP16 format is very useful. Numbers specified in this format subsequently have a sign bit, 5 exponent bits and 10 mantissa bits, however other formats maybe useful as well depending on accuracy requirements and availability of computational resources. The inventors recognized that a substantial computational effort is involved in the discretization of a floating-point number with respect to a predefined level in the hardware. Conventionally this requires a floating-point division operation and multiplication operation. The overall operation of discretization of a floating-point number X with respect to a floating-point positive number Y, referred to as discretization level, can be formulated as: Discretize ( T) — sign( The conventional way of discretization also substantially contributes to a latency in operation of the neural network. The computation engine comprises a discretizing unit that performs this operation in a computationally efficient way. For this purpose, it is configured to: a) compute a difference exponent value of the discretized floating point number by subtracting a reference exponent value from the exponent value (VEX) of the floating point number to be discretized; b) output a value 0 as the discretized floating point value of the discretized floating point number if the difference exponent value is less than 0; c) if the difference exponent value is greater than or equal to 0 output a value of the discretized floating point number having an exponent component equal to the difference exponent value and having a mantissa component of which the n most significant bits are equal to the n most significant bits of the floating point number to be discretized, where the value of n is equal to the difference exponent value. Contrary to conventional solutions the discretizing un