EP-3767455-B1 - APPARATUS AND METHOD FOR PROCESSING FLOATING-POINT NUMBERS
Inventors
- ELLIOTT, Sam
Dates
- Publication Date
- 20260513
- Application Date
- 20200716
Claims (13)
- A machine-implemented method of processing an input set comprising two floating-point numbers (A, B), each floating-point number having a sign, to generate a sum (A+B) and a difference (A-B) of the two floating-point numbers, the method comprising: receiving (804) the two floating-point numbers of the input set; calculating (806) a sum of the absolute values of the two floating-point numbers, using a same-sign floating-point adder (1020), to produce a first result; calculating (808) a difference of the absolute values of the two floating-point numbers, using a floating-point subtractor (1032), to produce a second result; and generating (810, 812) the sum (A+B) of the two floating-point numbers and the difference (A-B) of the two floating-point numbers based on: the first result, the second result, and the sign of each floating-point number, wherein the same-sign floating-point adder (1020) is implemented in fixed function circuitry configured to add together floating-point numbers having the same sign, and wherein the same-sign floating-point adder does not include circuitry configured to add together numbers having different signs, wherein generating the sum (A+B) of the two floating-point numbers and the difference (A-B) of the two floating-point numbers comprises correcting (810) a sign of the first result and a sign of the second result, wherein, if the first number (A) of the two numbers is positive, the sign of the first result is not changed and sign of the second result is not changed, and if the first number (A) of the two numbers is negative, the sign of the first result is set to denote a negative number and the sign of the second result is changed.
- The method of claim 1, wherein generating the sum (A+B) of the two floating-point numbers and the difference (A-B) of the two floating-point numbers comprises: generating (812) the sum (A+B) of the two floating-point numbers from one of the first result and the second result; and generating (812) the difference (A-B) of the two floating-point numbers from the other of the first result and the second result.
- The method of claim 1, wherein the floating-point subtractor is implemented in fixed function circuitry and/or by a mixed-sign floating-point adder.
- A circuit configured to process an input set comprising two floating-point numbers (A, B), each floating-point number having a sign, to generate a sum (A+B) and a difference (A-B) of the two floating-point numbers, the circuit comprising: an input, configured to receive (804) the two floating-point numbers of the input set; a same-sign floating-point adder (1020), configured to calculate (806) a sum of the absolute values of the two floating-point numbers, to produce a first result; a floating-point subtractor (1032), configured to calculate (808) a difference of the absolute values of the two floating-point numbers, to produce a second result; and multiplexing and sign-correction logic (1010), configured to generate the sum (A+B) of the two floating-point numbers and the difference (A-B) of the two floating-point numbers based on: the first result, the second result, and the sign of each floating-point number (A, B), wherein the same-sign floating-point adder is implemented in fixed function circuitry configured to add together floating-point numbers having the same sign, and wherein the same-sign floating-point adder does not include circuitry configured to add together numbers having different signs, wherein the multiplexing and sign-correction logic (1010) is configured to correct a sign of the first result and a sign of the second result, wherein, if the first number (A) of the two numbers is positive, the sign of the first result is not changed and sign of the second result is not changed, and if the first number (A) of the two numbers is negative, the sign of the first result is set to denote a negative number and the sign of the second result is changed.
- The circuit of claim 4, wherein the floating-point subtractor (1032) is implemented in fixed function circuitry and/or by a mixed-sign floating-point adder.
- The circuit of any one of claims 4 to 5, wherein the multiplexing and sign-correction logic (1010) is configured to: generate the sum (A+B) of the two floating-point numbers from one of the first result and the second result; and generate the difference (A-B) of the two floating-point numbers from the other of the first result and the second result.
- A processing system comprising the circuit of any one of claims 4 to 6.
- A processing system configured to perform the method of any one of claims 1 to 3.
- The processing system of claim 7 or 8 wherein the processing system is a graphics processing system or an artificial intelligence accelerator system.
- A method of manufacturing, using an integrated circuit manufacturing system, a circuit as claimed in any one of claims 4 to 6 or a processing system as claimed in any of claims 7 to 9.
- An integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a circuit as claimed in any one of claims 4 to 6 or a processing system as claimed in any of claims 7 to 9.
- A computer-implemented method of processing a computer-readable description of an integrated circuit to generate a representation of the integrated circuit, the method comprising: receiving the computer-readable description of the integrated circuit; identifying, in the computer-readable description of the integrated circuit, a description of one or more functional blocks for calculating a sum and difference of two floating-point numbers; and generating the representation of the integrated circuit, wherein said one or more functional blocks are represented, in the representation of the integrated circuit, as a representation of a circuit according to any one of claims 4 to 6.
- Computer program code configured to cause one or more processors to perform the method of claim 12 when the code is run on the one or more processors.
Description
Background Floating-point arithmetic is useful in a variety of applications, including but not limited to graphics, data processing, image processing, signal processing, control algorithms, scientific programming, and many more applications. Adding together floating-point numbers is one of the most fundamental operations in floating-point arithmetic, and it is ubiquitous across the various different applications and implementations. Floating-point addition may be implemented in software, e.g. by executing suitable instructions on a general purpose processing unit. Alternatively, floating-point addition may be implemented in hardware, e.g. by configuring fixed-function circuitry appropriately. Generally, a software implementation allows for greater flexibility than a hardware implementation (e.g. in terms of changing the operation of the addition after design time, e.g. changing the number of numbers to be added together); whereas generally, a hardware implementation provides a more efficient operation (e.g. in terms of lower latency and lower power consumption) compared to a software implementation. Therefore, if the efficiency of the operation is deemed to be more important than flexibility (e.g. if a specific type of addition is known to be needed to be performed many times in a device where power consumption and latency are important, such as a battery-powered mobile device, e.g. a smart phone, tablet or laptop) then a hardware implementation may be more appropriate than a software implementation. US 8,645,449 B1 discloses circuitry configured to simultaneously add and subtract a first and a second signed floating-point input number to provide a sum and a difference of the first and second signed floating-point input numbers. The combined floating-point addition and subtraction circuitry comprises an adder and at least two subtractors. Summary This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. When implementing any functionality (e.g. floating-point addition) in dedicated hardware, the size of the hardware is a consideration, particularly if the hardware is to be used in a device whose size is tightly constrained, e.g. in a mobile device. Therefore, when designing hardware for processing units, there is a trade-off to be made between: (i) power consumption, (ii) processing performance, and (iii) size (which may also be referred to as "semiconductor area" or "silicon area"). Improvements in one of these factors (e.g. reduced power consumption, increased processing performance or reduced silicon area) can be made but this may result in a worsening in one or both of the other factors (e.g. increased power consumption, reduced processing performance or increased silicon area). Adder circuits and associated methods for processing a set of at least three floating-point numbers to be added together are described herein which can provide an improvement in one or more of these factors without necessarily resulting in a worsening of the other factor(s). The method comprises identifying, from among the at least three numbers, at least two numbers that have the same sign - that is, at least two numbers that are both positive or both negative. The identified at least two numbers are added together using one or more same-sign floating-point adders. A same-sign floating-point adder comprises circuitry configured to add together floating-point numbers having the same sign and does not include circuitry configured to add together numbers having different signs. According to an aspect not presently claimed there is provided a machine-implemented method of processing an input set comprising at least three floating-point numbers to be summed, the input set including one or more positive numbers and one or more negative numbers, the method comprising: receiving the at least three floating-point numbers of the input set;identifying at least two numbers in the input set that have the same sign; andadding together the identified at least two numbers using one or more same-sign floating-point adders, to produce one or more partial summation results,wherein the one or more same-sign floating-point adders are implemented in fixed function circuitry configured to add together floating-point numbers having the same sign, and wherein the one or more same-sign floating-point adders do not include circuitry configured to add together numbers having different signs. The present inventors have recognised two things. Firstly, it is easier to add together floating-point numbers if it is known in advance that those numbers have the same sign. Secondly, in any set of three numbers there must be at least two numbers having the same sign (or, more generally, in any se