CN-121996887-A - Arithmetic circuit, accelerator, chip and board card

CN121996887ACN 121996887 ACN121996887 ACN 121996887ACN-121996887-A

Abstract

The present disclosure relates to an arithmetic circuit, an accelerator, a chip, and a board card. The arithmetic circuit may be implemented such that the computing means is comprised in a combined processing means, which may comprise one or more data processing means. The foregoing combined processing means may also include interface means and other processing means. The computing device interacts with other processing devices to jointly complete the computing operation designated by the user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. By the scheme, repeated multiplication operation in matrix multiplication operation can be simplified into unitary operation accumulation based on a counter, so that dynamic power consumption of hardware can be saved.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

中科寒武纪科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20241101

Claims (11)

1. An arithmetic circuit comprising: A data storage circuit storing a pre-calculation look-up table in which pre-calculation results relating to all possible values of the element product in the two matrix vectors of the matrix multiplication calculation are stored; a processing unit array including a plurality of processing units, adjacent processing units electrically connected, wherein each processing unit includes a plurality of counters, the operation result of each counter is configured to index a pre-calculation result in the pre-calculation lookup table, and The data conversion circuit is electrically connected with each processing unit in the processing unit array and the data storage circuit, so as to calculate a matrix multiplication result among the matrix vectors according to all operation results of the counter in each processing unit and the corresponding pre-calculation result.
2. The arithmetic circuit of claim 1, wherein a pre-calculation result in the pre-calculation look-up table is all possible values of a product of elements in two matrix vectors of a matrix multiplication calculation, and a counter in each processing unit corresponds to a one-hot encoding of the pre-calculation result, the arithmetic circuit being for performing a matrix multiplication calculation as follows: Each processing unit in the processing unit array receives elements in two input vectors to be subjected to matrix multiplication calculation, and activates a corresponding counter to perform increment operation according to the received elements in each calculation period, and The data conversion circuit is used for multiplying all operation results of the counter in each processing unit with corresponding pre-calculation results to obtain matrix multiplication results of the two input vectors.
3. The arithmetic circuit of claim 2, wherein elements of both matrix vectors of the matrix multiplication calculation are of the INT4 type, the pre-calculation look-up table having matrix vectors { -8, -7 stored therein, pre-calculation of the product between-1, 0,1, -7, -and { -8, -7, -1, -7, -each processing unit in the array of processing units contains 256 or 225 counters.
4. The operational circuit of claim 1, wherein all possible values of the element product in the two matrix vectors of the matrix multiplication calculation are translated into: Wherein x and y are elements in two matrix vectors calculated by matrix multiplication and are integers, and the pre-calculation result in the pre-calculation lookup table is that Wherein n is x+y, or x-y.
5. The arithmetic circuit of claim 4, wherein each processing unit includes a plurality of positive counters corresponding to a sum of elements in two matrix vectors of the matrix multiplication and a plurality of negative counters corresponding to differences between elements in two matrix vectors of the matrix multiplication, the positive and negative counters corresponding to one-time encoding of the pre-calculation result, the arithmetic circuit to perform matrix multiplication as follows: Each processing unit in the processing unit array receives elements in two input vectors to be subjected to matrix multiplication calculation, and in each calculation period, the processing unit activates corresponding positive counters to perform increment operation according to the sum of the received elements and activates corresponding negative counters to perform increment operation according to the difference of the received elements, and The data conversion circuit is used for multiplying the difference value between all operation results of the positive counter and all operation results of the negative counter in each processing unit with the corresponding pre-calculation result to obtain a matrix multiplication result of the two input vectors.
6. The operational circuit of claim 5, wherein the two matrix vectors of the matrix multiplication calculation comprise low-bit matrix vectors.
7. The arithmetic circuit of claim 6, wherein both matrix vector elements of the matrix multiplication calculation are of the INT4 type, n has a range of values [2,16], and each processing unit in the array of processing units includes 15 positive counters and 14 negative counters.
8. The arithmetic circuit according to any one of claims 1 to 7, wherein the counter in the processing unit comprises a ripple counter.
9. An accelerator, wherein a main computing unit of the accelerator employs the arithmetic circuit according to any one of claims 1 to 8.
10. A chip comprising the arithmetic circuit according to any one of claims 1 to 8.
11. A board card comprising the chip of claim 10.

Description

Arithmetic circuit, accelerator, chip and board card Technical Field The present disclosure relates generally to the field of circuits. More particularly, the present disclosure relates to an arithmetic circuit, an accelerator, a chip, and a board card. Background Deep neural networks (Deep neural network, DNN for short) have become very important in various fields, including computer vision, natural language processing (Natural Language Processing, NLP for short), and speech recognition. The present research finds that the scale of the model is closely related to the performance capability of the model. This trend is manifested in the evolution from early recurrent neural networks (1500 tens of thousands of parameters) to larger scale models such as transfomers (6500 tens of thousands of parameters) and BERTs (up to 3.4 billion parameters), as well as modern large models such as LLaMA (up to 700 billion parameters) and GPT-4. The parameters of DNN gradually increased from the first millions to billions. However, the scale of DNNs also places great demands on the computational resources required to infer on various devices, driving the rise of quantization techniques. Large models with billions of parameters exceed the memory capacity of the most powerful consumer hardware, making their low-level versions more popular. In the prior art, a large number of repeated multiplication operations and high-bit-width accumulation operations are involved in the matrix multiplication process of a processor (such as a systolic array based on a multiplication and addition unit) which supports matrix multiplication operations, and as the bit width of data is reduced, the efficiency return is reduced, and when the data precision is reduced to a certain low bit width (such as INT 4), the actual calculation efficiency is not obvious. Disclosure of Invention In view of the above-mentioned technical problems in the background section, the present disclosure proposes an arithmetic circuit. By utilizing the operation circuit disclosed by the invention, repeated multiplication operation in matrix multiplication operation can be converted into accumulation operation, so that the power consumption of hardware is effectively reduced. In a first aspect, the present disclosure provides an arithmetic circuit comprising a data storage circuit storing a pre-calculation look-up table storing pre-calculation results relating to all possible values of the product of elements in two matrix vectors of a matrix multiplication calculation, a processing unit array comprising a plurality of processing units, adjacent processing units being electrically connected, wherein each processing unit comprises a plurality of counters, the operation result of each counter being configured to index a pre-calculation result in the pre-calculation look-up table, and a data conversion circuit electrically connected to each processing unit in the processing unit array and the data storage circuit for calculating a matrix multiplication result between the matrix vectors based on all operation results of the counters in each processing unit and the corresponding pre-calculation result. In a second aspect, the present disclosure provides an accelerator whose main computing unit employs the arithmetic circuit according to the first aspect. In a third aspect, the present disclosure provides a chip comprising an arithmetic circuit according to the first aspect. In a fourth aspect, the present disclosure provides a board comprising a chip according to the third aspect. By the scheme provided by the aspects, the operation complexity of matrix multiplication and the hardware power consumption can be effectively reduced. Specifically, the pre-calculation result related to all possible values of the element product participating in the matrix multiplication can be stored through a pre-calculation lookup table, and the repeated multiplication operation in the matrix multiplication operation is simplified into the unitary operation accumulation based on the counter by utilizing the counter and the data conversion circuit contained in the processing units in the processing unit array, so that the dynamic power consumption of hardware can be saved. Further, in some embodiments, a formula is utilizedAll possible values of element products in two matrix vectors calculated by matrix multiplication are converted to reduce the number of counters in a processing unit, so that repeated arithmetic can be saved, accumulation intensity is reduced, and static power consumption overhead is effectively reduced. Drawings The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of t