EP-3842926-B1 - RANK-BASED DOT PRODUCT CIRCUITRY

EP3842926B1EP 3842926 B1EP3842926 B1EP 3842926B1EP-3842926-B1

Inventors

LANGHAMMER, MARTIN

Dates

Publication Date: 20260506
Application Date: 20200827

Claims (12)

An integrated circuit, comprising: a partial product generation circuit configured to receive input operands and to generate corresponding partial products; a first compressor circuit (602-1) configured to receive a first group of the partial products all having a first rank and configured to output first vectors; and a second compressor circuit (602-2) configured to receive a second group of the partial products all having a second rank that is different than the first rank and configured to output second vectors, first shifting circuits (606) configured to shift the second vectors relative to the first vectors, a third compressor circuit (602-3) configured to receive a third group of the partial products all having a third rank that is different than the first and second ranks and configured to output third vectors; a fourth compressor circuit (602-3) configured to receive a fourth group of the partial products all having a fourth rank that is different than the first, second, and third ranks and configured to output fourth vectors, second shifting circuits (607) configured to shift the fourth vectors relative to the third vectors; a fifth compressor (608-1) configured to compress the first vectors and the second shifted vectors and configured to output fifth vectors; a sixth compressor (608-2) configured to compress the third vectors and the fourth shifted vectors and configured to output sixth vectors; and third shifting circuits (610) configured to shift the sixth vectors relative to the fifth vectors, wherein the third shifting circuits (610) are selectively bypassable to support a plurality of input precisions.
The integrated circuit of claim 1, wherein the first group of the partial products are not shifted relative to each other.
The integrated circuit of claim 2, wherein the second group of the partial products are not shifted relative to each other.
The integrated circuit of claim 1, further comprising: a seventh compressor configured to compress the fifth vectors and the sixth vectors to output corresponding seventh vectors; and a carry-propagate adder configured to receive the seventh vectors and to output a corresponding dot product value.
The integrated circuit of any one of claims 1-4, further comprising: an aggregation circuit configured to aggregate one's to two's complement conversion bits associated with the partial products.
The integrated circuit of claim 5, wherein the aggregation circuit is further configured to aggregate the conversion bits into a single vector.
The integrated circuit of claim 5, wherein the one's to two's complement conversion bit aggregation circuit is further configured to aggregate the conversion bits into at least two different vectors.
An integrated circuit, comprising: partial product generation circuitry configured to receive input signals and to generate a plurality of partial products; and a compressor tree divided into a plurality of compressor groups organized based on the rank of the partial products received at each of the plurality of compressor groups, and wherein the partial products in each of the plurality of compressor groups have identical ranks; a one's to two's complement conversion bit aggregation circuit configured to generate at least one vector that is injected at a single point in the compressor tree; wherein the compressor tree comprises a set of shifting circuits (610) that is switched into use when operating the compressor tree to support a first precision mode and that is switched out of use when operating the compressor tree to support a second precision mode different than the first precision mode.
The integrated circuit of claim 8, further comprising: a first one's to two's complement conversion bit aggregation circuit; a second one's to two's complement conversion bit aggregation circuit; and a multiplexer configured to select only the first one's to two's complement conversion bit aggregation circuit during the first precision mode and to select only the second one's to two's complement conversion bit aggregation circuit during the second precision mode.
The integrated circuit of claim 8, further comprising: Dot product circuitry (600) that is decomposed into a first dot group and a second dot group to reduce compressor word growth in the dot product circuitry, wherein the first dot group has a first number of multiplies, and wherein the second dot group has a second number of multiplies that is different than the first number of multiplies.
The integrated circuit of claim 10, further comprising: a first aggregation circuit configured to aggregate conversion bits associated with the first dot group; a second aggregation circuit configured to aggregate conversion bits associated with the second dot group; and a compressor configured to compress values received from the first and second aggregation circuit.
The integrated circuit of claim 10 or 11, wherein the dot product circuitry is optionally further decomposed into a third dot group having a third number of multiplies that is different than the first and second numbers of multiplies.

Description

Background This invention relates generally to integrated circuits and, in particular, to integrated circuits operable to support dot product arithmetic. Recent developments in artificial intelligence such as advancements in machine learning and deep learning involve training and inference, which have necessitated a much higher density of dot product computations with multiple precisions. Conventional dot product circuitry includes different multiplier groups, each of which is configured to compute a different product. For example, a 4-element dot product circuit for computing the dot product of a first vector [a3, a2, a1, a0] and a second vector [b3, b2, b1, b0] will include a first multiplier group for computing a0*b0, a second multiplier group for computing a1*b1, a third multiplier group for computing a2*b2, and a fourth multiplier group for computing a3*b3. Forming dot product circuits using this conventional structure may require a significant amount of circuit area, which is exacerbated as the precision of each element ai or bi increases beyond 4 bits, beyond 8 bits, or beyond 10 bits. This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art. The article "Physically Aware Affinity-Driven Multiplier Implementation", by Maltabashi Or et al., IEEE Transactions on computer aided design of integrated circuits and systems, IEEE Service Center, Piscataway, NJ, US, vol. 39, no. 10, 25 July 2019, pages 2886-2897, XP011811064, proposes a physical-aware approach to FDP implementation based on the affinity between the logic gates that make up the gate-level structure. The proposed clustered DP (CDP) algorithm, enables the place and route tools to cluster gates with high-affinity, leading to higher placement utilization and lower routing congestion. DP calculations with up to 78 multipliers were implemented with a 65-nm CMOS standard cell library, providing power reduction of up to 63%, up to 60% lower area, and performance improvements as high as 2.5x, as compared to similar implementations based on commercial macros based on post-layout results. The article "A Fast Inner Product Processor Based on Equal Alignments", by S. P. SMITH et al., journal of parallel and distributed computing, vol. 2, no. 4, 1 November 1985, pages 376-390, XP000084818 describes the design of a fast inner product processor, with appreciably reduced latency and cost. The inner product processor is implemented with a tree of carry-propagate or carry-save adders. The partial products, to be summed in producing an inner product, are reordered according to their "minimum alignments." This reordering brings approximately a 20% saving in hardware-including adders and data paths. The invention is defined in independent claim 1. Embodiments of the invention are described in the dependent claims. Brief Description of the Drawings FIG. 1 is a diagram of an illustrative integrated circuit that includes digital signal processing (DSP) blocks in accordance with an embodiment.FIG. 2 is a diagram of an illustrative 3-element dot product circuit.FIG. 3 is a diagram illustrating multiplier products generated using a radix-4 Booth's encoding in accordance with an embodiment.FIG. 4 is a diagram showing one implementation of a 3-element dot product circuit that is susceptible to substantial word growth in compressor outputs.FIG. 5 is a diagram illustrating additional circuitry that is needed for implementing sign extension and one's complement to two's complement conversion in accordance with an embodiment.FIG. 6 is a diagram of illustrative rank-based dot product circuitry in accordance with an embodiment.FIG. 7A is a diagram illustrating how the additional one for the one's to two's complement conversion for each partial product may first be aggregated as a single vector prior to compression in accordance with an embodiment.FIG. 7B is a diagram illustrating how the aggregated vector may be injected into the compressor tree in accordance with an embodiment.FIG. 8A is a diagram illustrating the aggregation of the additional ones when there are six partial products in accordance with an embodiment.FIG. 8B is a flow chart of illustrative steps for inserting multiple aggregated vectors at different points in the compressor tree in accordance with an embodiment.FIG. 8C is a flow chart of illustrative steps for inserting a combined aggregated vector at a single point in the compressor tree in accordance with an embodiment.FIG. 9 is a diagram illustrating the aggregation of the additional ones when there are 12 partial products in accordance