CN-121980123-A - Computing circuit, vector operation method, matrix processing circuit and chip

CN121980123ACN 121980123 ACN121980123 ACN 121980123ACN-121980123-A

Abstract

The embodiment of the invention discloses a computing circuit, a vector operation method, a matrix processing circuit and a chip, wherein the computing circuit stores a data set to be processed in a target data format through a storage area, a decoding circuit decodes data items in a first vector and a second vector in the data set to be processed, determines the sign number, the exponent and the mantissa of each data item, pre-shifts the mantissa of each data item based on a preset exponent determined according to the exponent corresponding to the target data format, stores the pre-shifted data item into a buffer area, and an operation circuit performs multiply-add calculation on the data items in the pre-shifted first vector and the second vector to determine the dot product of the first vector and the second vector. Therefore, in the embodiment, the alignment times in the dot product operation process can be reduced by pre-shifting each data item, so that the consumption of a hardware operation unit is reduced on the basis of realizing the vector dot product operation.

Inventors

DING QIAN
ZHOU JINYUAN
LIU LU
ZOU YUNXIAO

Assignees

平头哥(上海)半导体技术有限公司

Dates

Publication Date: 20260505
Application Date: 20250509

Claims (12)

1. A computing circuit, the computing circuit comprising: a storage area for storing a data set to be processed in a target data format, the data set to be processed comprising a first vector and a second vector of target data dimensions; The decoding circuit is used for decoding the data items in the first vector and the second vector, determining the number of signs, indexes and mantissas of the data items, pre-shifting the mantissas of the data items based on preset indexes, and storing the pre-shifted data items into a buffer zone, wherein the preset indexes are determined according to indexes corresponding to the target data format; And the operation circuit is used for carrying out multiplication and addition calculation on the data items in the first vector and the second vector after the pre-displacement and determining the dot product of the first vector and the second vector.
2. The computing circuit of claim 1, wherein the computing circuit further comprises: A bus for broadcasting a first data packet and a second data packet to the arithmetic circuitry, the first data packet comprising data items in a first dimension range in the first vector and data items in a first dimension range in the second vector, the second data packet comprising data items in a second dimension range in the first vector and data items in a second dimension range in the second vector.
3. The computing circuit of claim 2, wherein the arithmetic circuit comprises: the system comprises a first operation unit, a second operation unit, a third operation unit and a fourth operation unit, wherein the first operation unit is used for carrying out mantissa multiplication on low-dimensional data items in the first data packet to determine a first mantissa product corresponding to each low-dimensional data item; A second operation unit, configured to multiply and multiply data items in the second data packet, and determine a third accumulation result corresponding to the second data packet; The index selector is used for outputting a target index according to a high-dimensional scaling factor, a low-dimensional scaling factor and an accumulated offset value, wherein the high-dimensional scaling factor is a scaling parameter for adjusting a mantissa dot product in a high-dimensional space, the low-dimensional scaling factor is a scaling parameter for adjusting the mantissa dot product in a low-dimensional space, and the accumulated offset value is a scaling parameter of the sum of mantissa dot products corresponding to the number of iteration layers in accumulated iterative calculation; The third operation unit is used for summing the second mantissa products to determine a second accumulated result, aligning and summing the second accumulated result and the third accumulated result according to the target exponent, and determining an intermediate accumulated result; And the fourth operation unit is used for summing the first mantissa products to determine a first accumulation result, and summing the first accumulation result and the intermediate accumulation result according to the target exponent to determine dot products corresponding to the first vector and the second vector.
4. The computing circuit of claim 3, wherein the first arithmetic unit comprises a plurality of parallel multipliers, each multiplier configured to read a preset number of data items in a first vector and corresponding dimensions of data items in the second vector from the first data packet, and multiply mantissas in the data items corresponding to the respective dimensions read, respectively, to determine a first mantissa product corresponding to each low-dimensional data item and a second mantissa product corresponding to each high-dimensional data item in the first data packet, wherein the preset number is determined according to a data bit width of the multiplier and a data bit width corresponding to the target data format.
5. The computing circuit of claim 3, wherein the third arithmetic unit comprises: A sign compensator, configured to adjust the sign of the corresponding second mantissa product according to the sign number of the data item corresponding to each second mantissa product; The multiplexer is used for accumulating each second mantissa product after the symbol adjustment to determine a second accumulation result; A first shifter for shifting the second accumulation result and the third accumulation result according to the target index such that the index of the second accumulation result and the index of the third accumulation result are aligned with the target index; And the first adder is used for summing the aligned second accumulation result and the aligned third accumulation result to determine an intermediate accumulation result.
6. The computing circuit of claim 3, wherein the fourth arithmetic unit comprises: a second shifter for shifting each of the first mantissa products according to the target exponent so that each of the first mantissa products is aligned; the second adder is used for accumulating the aligned first mantissa products to determine a first accumulation result; and accumulating the first accumulation result and the intermediate accumulation result to determine dot products corresponding to the first vector and the second vector.
7. The computing circuit of claim 1, wherein the computing circuit further comprises: And the normalization logic circuit is used for converting the dot product data format to output the dot product of the target data format.
8. The computing circuit of claim 3, wherein the target data format employs a 4-bit floating point number.
9. A method of vector operation, the method comprising: reading a data set to be processed in a target data format, wherein the data set to be processed comprises a first vector and a second vector of a target data dimension; decoding data items in the first vector and the second vector, and determining the number of symbols, exponent and mantissa of each data item; Pre-shifting mantissas of the data items based on preset indexes, wherein the preset indexes are determined according to indexes corresponding to the target data format; and performing multiply-add calculation on the data items in the first vector and the second vector after the pre-displacement, and determining the dot product of the first vector and the second vector.
10. A matrix processing circuit, the matrix processing circuit comprising: The computing circuit of at least one of claims 1-8, configured to read row vectors in a first matrix and column vectors in a second matrix according to a matrix multiplication computation rule, and to determine dot products of the read row vectors and column vectors.
11. A chip having disposed thereon a computing circuit as claimed in any one of claims 1 to 8 or a matrix processing circuit as claimed in claim 10.
12. An electronic device, characterized in that it comprises a chip as claimed in claim 11.

Description

Computing circuit, vector operation method, matrix processing circuit and chip Technical Field The present invention relates to the field of computer technology, and more particularly, to a computing circuit, a vector operation method, a matrix processing circuit, and a chip. Background When a traditional Tensor operation unit (namely a Tensor Core) performs floating point number Dot Product (DP, dot Product), the operation process comprises the following steps of multiplying mantissas of data in vectors by each other to obtain k products, performing alignment operation on the mantissa products of input row and column vectors to obtain uniform index sums, obtaining specific displacement quantity before each mantissa displacement, namely the difference between the corresponding index sum and the maximum index sum, and accumulating the aligned mantissa products to obtain signed accumulated sums before normalization. If micro-scaling is to be supported, it is also necessary to consider scale alignment intermediate results before result output and normalize the output results according to the format of instruction floating point number output. However, with the development of LLM and the continuous improvement of the computational power requirement of the hardware operation unit, how to reduce the consumption of hardware resources based on the realization of vector dot product operation has important practical significance. Disclosure of Invention Accordingly, embodiments of the present invention are directed to a computing circuit, a vector operation method, a matrix processing circuit, and a chip for reducing hardware resource consumption based on the implementation of vector dot product operation. In a first aspect, embodiments of the present invention are directed to a computing circuit, including: a storage area for storing a data set to be processed in a target data format, the data set to be processed comprising a first vector and a second vector of target data dimensions; the decoding circuit is used for decoding the data items in the first vector and the second vector, determining the number of signs, indexes and mantissas of the data items, pre-shifting the mantissas of the data items based on preset indexes, storing the pre-shifted data items in a buffer area, and determining the preset indexes according to the indexes corresponding to the target data format; And the operation circuit is used for carrying out multiplication and addition calculation on the data items in the first vector and the second vector after the pre-displacement and determining the dot product of the first vector and the second vector. Further, the computing circuit further includes: A bus for broadcasting a first data packet and a second data packet to the arithmetic circuitry, the first data packet comprising data items in a first dimension range in the first vector and data items in a first dimension range in the second vector, the second data packet comprising data items in a second dimension range in the first vector and data items in a second dimension range in the second vector. Further, the arithmetic circuit includes: the system comprises a first operation unit, a second operation unit, a third operation unit and a fourth operation unit, wherein the first operation unit is used for carrying out mantissa multiplication on low-dimensional data items in the first data packet to determine a first mantissa product corresponding to each low-dimensional data item; A second operation unit, configured to multiply and multiply data items in the second data packet, and determine a third accumulation result corresponding to the second data packet; The index selector is used for outputting a target index according to a high-dimensional scaling factor, a low-dimensional scaling factor and an accumulated offset value, wherein the high-dimensional scaling factor is a scaling parameter for adjusting a mantissa dot product in a high-dimensional space, the low-dimensional scaling factor is a scaling parameter for adjusting the mantissa dot product in a low-dimensional space, and the accumulated offset value is a scaling parameter of the sum of mantissa dot products corresponding to the number of iteration layers in accumulated iterative calculation; The third operation unit is used for summing the second mantissa products to determine a second accumulated result, aligning and summing the second accumulated result and the third accumulated result according to the target exponent, and determining an intermediate accumulated result; And the fourth operation unit is used for summing the first mantissa products to determine a first accumulation result, and summing the first accumulation result and the intermediate accumulation result according to the target exponent to determine dot products corresponding to the first vector and the second vector. Further, the first operation unit includes a plurality of parallel multipliers, each multiplier i