CN-121722356-B - Multiply-accumulate operation method and device, neural network processing device and electronic equipment

CN121722356BCN 121722356 BCN121722356 BCN 121722356BCN-121722356-B

Abstract

The application discloses a multiply-accumulate operation method, a device, a neural network processing device and an electronic device, wherein the multiply-accumulate operation method is used for inputting the data type of activation data according to original weight data, respectively converting the data into signed weight data and signed activation data, and carrying out Booth coding on the signed activation data to generate a coded activation value. In each selection shift module, one of the signed weight data is selected as target data according to the code activation value, shift is carried out to obtain a multiplication result of the signed weight data and the code activation value, then the multi-row multiplication result of the same column is added to obtain an intra-column row accumulation result, shift accumulation is carried out on all the intra-column row accumulation results to generate an initial multiplication accumulation result, and compensation is carried out on the initial multiplication accumulation result by combining a compensation code. The application can improve throughput rate, optimize power consumption and calculation power, and comprehensively improve the energy efficiency ratio and efficiency of multiply-accumulate operation.

Inventors

YANG XIAOFENG
LIU HONGJIE

Assignees

深圳市九天睿芯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260226

Claims (18)

1. A multiply-accumulate method, comprising: Obtaining corresponding signed weight data according to the data type of the original weight data, wherein the signed weight data comprises positive and negative weight data, the signed weight data is broadcasted to a selection shift module in a storage and calculation array, and the storage and calculation array comprises m rows and n columns of the selection shift module; obtaining corresponding signed activation data according to the data type of the input activation data, carrying out Booth coding on the signed activation data, generating a plurality of coding activation values, and broadcasting the coding activation values to the selection shift module in the memory array; The selection shifting module selects one of the received signed weight data as target data according to the type of the code activation value received by the selection shifting module, and carries out shifting operation on the target data to obtain a multiplication result of the signed weight data and the code activation value; for each selection shift module in the storage array, carrying out addition processing on the multiplied results of a plurality of rows of the same column to obtain an intra-column row accumulation result; performing shift accumulation on all the column inner row accumulation results to generate an initial multiply accumulation result; And determining a compensation code according to the data type of the original weight data, the data type of the input activation data and the coding activation value, and compensating the initial multiply-accumulate result through the compensation code to obtain a final multiply-accumulate result.
2. The multiply-accumulate method of claim 1, wherein the obtaining the corresponding signed weight data according to the data type of the original weight data comprises: If the data type of the original weight data is an unsigned number, subtracting a first offset from the original weight data to convert the unsigned original weight data into signed weight data comprising positive and negative weight data; And if the data type of the original weight data is signed numbers, taking the original weight data as signed weight data.
3. The multiply-accumulate method of claim 1, wherein the obtaining the corresponding signed activation data according to the data type of the input activation data, and performing booth encoding on the signed activation data, generating a plurality of encoded activation values, comprises: If the data type of the input activation data is an unsigned number, subtracting a second offset from the input activation data to obtain signed activation data; If the data type of the input activation data is signed number, the input activation data is used as signed activation data; the signed activation data is booth encoded to generate a number of encoded activation values including a negative number and a non-negative number.
4. A multiply-accumulate method as in claim 3, wherein the booth encoding is a Radix-4 booth encoding, the booth encoding the signed activation data to generate a plurality of encoded activation values including negative numbers and non-negative numbers, comprising: S1, filling zero after the least significant bit of the all-symbol activation data to form extension data; S2, starting from the least significant bit of the extension data, selecting continuous Z-bit data leftwards as a first coding window; S3, taking the most significant bit of the first coding window as the least significant bit of the next coding window; s4, continuing to select continuous Z-1 bit data leftwards in the extended data, and forming a new coding window with the lowest significant bit of the next coding window; S5, repeating the steps S3 and S4 until the bit stream of the expansion data is selected; s6, mapping each coding window into a corresponding coding activation value according to a preset Booth coding rule, so as to obtain a plurality of coding activation values including negative numbers and non-negative numbers.
5. The multiply-accumulate method of claim 4, wherein the selecting shift module selects one of the received signed weight data as target data according to the type of the encoded activation value received by the selecting shift module, comprising: When the code activation value received by the selection shift module is negative, the selection shift module selects negative weight data as target data; When the code activation value received by the selection shift module is a non-negative number, the selection shift module selects positive weight data as target data.
6. The multiply-accumulate method of claim 1, wherein the adding the multiplied results in a plurality of rows in a same column to obtain an intra-column row accumulate result comprises: inputting the multiplication results of the same column to a first-stage adder for accumulation, and executing inversion operation on the highest bit of the accumulation result to obtain an in-column intermediate result; and accumulating all the in-column intermediate results to obtain in-column row accumulation results.
7. The multiply-accumulate method of any one of claims 1-6, wherein the determining a compensation code based on the data type of the raw weight data, the data type of the input activation data, and the code activation value comprises: determining a first compensation component based on a data type of the raw weight data; Determining a second compensation component based on the data type of the input activation data; determining a third compensation component based on a negative number in the encoded activation value; Determining a fourth compensation component based on a negation operation performed on the most significant bit of the accumulation result output by the first stage adder; And adding the first compensation component, the second compensation component, the third compensation component and the fourth compensation component to obtain a compensation code.
8. The method of claim 7, wherein if the data type of the original weight data is unsigned, the first compensation component is the product of the sum of all the input activation data and a first offset; If the data type of the input activation data is an unsigned number, multiplying the sum of all the original weight data of the second compensation component by a second offset; if the data type of the input activation data is signed number, the second compensation component is zero; the third compensation component is the sum of compensation values corresponding to all negative coding activation values in each calculation period; the fourth compensation component is the inverse of the sum of the fixed offsets introduced by performing the negation operation on the most significant bit of the accumulated result.
9. The multiply-accumulate method of claim 2 or 8, wherein the first offset is 2 (M-1), where M is the bit width of the raw weight data.
10. The multiply-accumulate method of claim 3 or 8, wherein the second offset is 2 (N-1), where N is the bit width of the input activation data.
11. A multiply-accumulate operation apparatus, comprising: The weight data coding module is used for obtaining corresponding signed weight data according to the data type of the original weight data, wherein the signed weight data comprises positive and negative weight data, and the signed weight data is broadcasted to the selection shifting module in the memory array; The activation data coding module is used for obtaining corresponding signed activation data according to the data type of the input activation data, carrying out Booth coding on the signed activation data, generating a plurality of coding activation values, and broadcasting the coding activation values to the selection shift module in the memory array; The system comprises a storage array, a plurality of addition processing modules, a plurality of storage array selection and shift modules, a plurality of addition processing modules and a plurality of storage array selection and shift module, wherein the storage array comprises m rows and n columns of selection and shift modules, the selection and shift modules are used for selecting one of the received signed weight data as target data according to the type of the received code activation value, and carrying out shift operation on the target data to obtain a multiplication result of the signed weight data and the code activation value; The shift accumulation module is used for carrying out shift accumulation on all the row accumulation results in the columns to generate an initial multiplication accumulation result; And the compensation calculation module is used for determining a compensation code according to the data type of the original weight data, the data type of the input activation data and the coding activation value, and compensating the initial multiply-accumulate result through the compensation code to obtain a final multiply-accumulate result.
12. The multiply-accumulate operator of claim 11, further comprising: an activation data buffer module for buffering the signed activation data processed by the activation data encoding module and/or, And the weight data caching module is used for caching the signed weight data obtained through processing by the weight data encoding module.
13. The multiply-accumulate apparatus of claim 12, wherein the memory array is a memory-based array.
14. The multiply-accumulate operator of claim 11, wherein the weight data encoding module is to: subtracting a first offset from the original weight data when the data type of the original weight data is an unsigned number, so that the unsigned original weight data is converted into signed weight data comprising positive and negative weight data; when the data type of the original weight data is signed numbers, the original weight data is used as signed weight data; broadcasting the signed weight data to a selection shift module in a memory array.
15. The multiply-accumulate operator of claim 11, wherein the activation data encoding module is to: subtracting a second offset from the input activation data to obtain signed activation data when the data type of the input activation data is an unsigned number; when the data type of the input activation data is signed numbers, the input activation data is used as signed activation data; performing booth encoding on the signed activation data to generate a number of encoded activation values including negative numbers and non-negative numbers; The encoded activation value is broadcast to the select shift module in a memory array.
16. A neural network processing device, comprising a multiply-accumulate operation device as claimed in any one of claims 11 to 15.
17. The neural network processing device of claim 16, wherein the neural network processing device includes a compiler, the weight data encoding module multiplexing the compiler such that weight encoding operations are completed while the compiler is offline.
18. An electronic device comprising multiply-accumulate operation means as claimed in any one of claims 11 to 15.

Description

Multiply-accumulate operation method and device, neural network processing device and electronic equipment Technical Field The embodiment of the application relates to the technical field of artificial intelligence, in particular to a multiply-accumulate operation method and device, a neural network processing device and electronic equipment. Background With the wide application of deep learning and artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology in the fields of reasoning, automatic driving, large language models and the like, unprecedented requirements on the calculation power and energy efficiency of bottom hardware are put forward. The core calculation paradigm of modern AI is a massive multiply-accumulate operation, and the core calculation unit of the existing neural network processor mainly relies on the multiply-accumulate operation (Multiply Accumulate, MAC). In the traditional von neumann architecture, a processor is separated from a memory, and data needs to be frequently carried when operation is performed, so that huge data movement energy consumption and access delay are generated, namely, the problem of 'memory wall', and the performance and energy efficiency improvement of an AI chip are seriously restricted. In order to fundamentally break through the bottleneck of a memory wall, a memory and calculation integrated technology is generated, and the core idea is that calculation is completed in a memory unit or near a memory end, so that data movement is reduced to the greatest extent. The integrated architecture of digital Memory compatible with complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) process based on Static Random Access Memory (SRAM) and other memories has become a key technical path for realizing an energy-efficient AI accelerator due to high bit width and high scalability of computation. In a digital storage and calculation integrated architecture, how to efficiently implement multiply-accumulate operation is the core of the design. The early scheme mostly adopts bit serial calculation, multiplies and accumulates the input bit by bit and the weight, has simple structure, but N clock cycles are needed for completing one time of N-bit multiplication, the calculation delay is large, the throughput rate is low, and the method becomes a main bottleneck of system performance. To accelerate the computation, the prior art introduced an optimization scheme based on base 4 booth coding. The scheme encodes the input continuous 2 bits, so that the theoretical calculation cycle number is reduced from N to about N/2, and the calculation throughput rate is remarkably improved. However, booth coding based optimization schemes introduce new, more complex, inherent problems in hardware implementation. Since booth coding based schemes are essentially signed operation designs, additional "zero-extension" preprocessing cycles are required to process unsigned data. The most common approach is to zero-extend the N-bit unsigned number to n+1 bits (i.e., to supplement the most significant bit with a '0') before computation, so that it becomes a positive signed number in form, which is fed into the encoder. This expansion and reformatting operation forces at least one additional clock cycle, resulting in a theoretical cycle reduction benefit (e.g., from N cycles to N/2 cycles) that cannot be realized when processing unsigned data streams, a significant drop in the actual acceleration ratio, and a limited increase in computational throughput. In addition, in order to generate and process the partial product data with signed binary numbers, the hardware must integrate complex negation, shift and dynamic sign extension circuits (such as a negation circuit and a shift circuit which are special for peripheral or internal integration of a storage array), which not only increases the system area and the power consumption, but also increases the combined logic depth because the circuits are positioned on a key calculation path, thereby limiting the working frequency of the system and introducing timing convergence difficulty and power consumption expense. Moreover, since negative partial products are involved, all intermediate operations must be performed in the complement domain, which forces the design to introduce dynamic sign bit expansion logic. These additional negation, shifting, and sign extension operations can greatly increase the area and power consumption of the circuit. Therefore, when the throughput rate is improved, the prior art realizes the multiply-accumulate calculation, which causes the calculation power consumption to increase when the input data is unsigned numbers, also affects the calculation power and the energy efficiency ratio of the multiply-accumulate calculation, and simultaneously causes the period of the multiply-accumulate calculation to be affected. Disclosure of Invention In order to solve the technical problems, the embod