CN-117521718-B - Error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation

CN117521718BCN 117521718 BCN117521718 BCN 117521718BCN-117521718-B

Abstract

The invention provides an error self-adaptive approximate multiplier for high-energy-efficiency self-attention mechanism calculation, which provides an approximate multiplier for self-adaptive adjustment of calculation errors according to the magnitude of a numerical value, wherein an approximate partial product generator with negative calculation errors and an approximate 4:2 compressor with positive calculation errors are designed, the errors of the approximate partial product generator and the approximate 4:2 compressor are mutually compensated, the overall errors can be effectively reduced, more approximate bits can be tolerated, an approximate multiplier circuit with smaller area and lower power consumption can be obtained, and a power_gating control circuit is added into an approximate compressor of a first-stage compressor array of a Wallace tree, so that the calculation errors can be self-adaptively adjusted according to the calculation numerical value.

Inventors

WANG ZHONGFENG
ZHANG XU
WANG MEIQI
SHA JIN
Zou Dingyang

Assignees

南京大学

Dates

Publication Date: 20260508
Application Date: 20231106

Claims (8)

1. The error self-adaptive approximate multiplier for the high-energy-efficiency self-attention mechanism calculation is characterized by comprising an accurate partial product generator, an approximate partial product generator with positive calculation errors, an accurate tree-shaped compressor module, an approximate tree-shaped compressor module with negative calculation errors and a power gating module; the accurate partial product generator is used for generating a high-order segment of the partial product, and the logical expression is as follows: Wherein a j represents the j-th bit of the multiplicand A, a j-1 represents the j-1 th bit of the multiplicand A, B 2i-1 represents the 2i-1 th bit of the multiplier B, B 2i represents the 2 i-th bit of the multiplier B, and B 2i+1 represents the 2i+1 th bit of the multiplier B; representing exclusive or logic operations; the approximate partial product generator is used for generating a low-order segment of the partial product, and the logic expression is as follows:
2. the error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 1, wherein the coding error of the approximation partial product generator is negative: PPG_A-PPG_E<0 (3) Where ppg_a represents the encoded value of the approximate partial product generator and ppg_e represents the encoded value of the exact partial product generator.
3. The error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 2, wherein said exact tree compressor module is an exact 4:2 compressor, logically expressed as: where a 1 ,a 2 ,a 3 ,a 4 is the four summed input signals of the precision 4:2 compressor, C in is the precision 4:2 compressor Carry input signal, carry is the Carry output signal of the precision 4:2 compressor, cout is the Carry overflow signal of the precision 4:2 compressor, and Sum is the summed output signal of the precision 4:2 compressor.
4. The error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 3, wherein said approximation tree compressor module comprises an approximation 4:2 compressor with a positive error for the low-order segments, the approximation 4:2 compressor having a logical expression: Sum=a 1 +a 2 +a 3 +a 4 (7)
5. the error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 4, wherein the computation error of the approximation 4:2 compressor is positive: PPA_A-PPA_E>0 (9) Where PPA_A represents the calculated value of the approximate 4:2 compressor and PPA_E represents the calculated value of the exact 4:2 compressor.
6. The adaptive error approximation multiplier for energy-efficient self-attention mechanism computation of claim 5 wherein said power gating module generates a string of control codes power_gating [ gating_num-1:0], gating_num representing the number of bits of control codes.
7. The error adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 6, wherein said power gating module comprises a symbol detector and a cascaded logic gate; The symbol detector is used for judging the symbol of the multiplicand, if the multiplicand is positive, starting from the highest bit of the multiplicand, using a cascade OR gate to carry out the OR operation for the number of times of the gaging_num, wherein the value of the gaging_num is a natural number, AND if the multiplicand is negative, starting from the highest bit of the multiplicand, using a cascade AND gate to carry out the AND operation for the number of times of the gaging_num, so as to obtain a control code of the gaging_num bit, AND the control code is used for controlling whether the gaging_num approximately 4:2 compressors starting from the lowest bit in the approximate tree-shaped compressor module are turned off.
8. The error adaptive approximation multiplier for energy-efficient self-attention mechanism computation of claim 7, wherein said power gating module itself carries power gating controlled by symbol detection bits, turning off a cascade or gate if the multiplicand is positive and turning off a cascade and gate if the multiplicand is negative.

Description

Error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation Technical Field The invention relates to an error-adaptive approximation multiplier for energy-efficient self-attention mechanism computation. Background The Transformer-based model achieves superior accuracy performance over conventional convolutional neural networks (Convolutional Neural Network, CNNs) in many artificial intelligence tasks such as natural language processing, computer vision, and the like. This excellent accuracy performance benefits from a self-attention mechanism that allows the transducer to possess a global rather than local sensing region. But on the other hand, the global self-attention mechanism has about 100 times more computation than CNNs, and the existing CNN accelerator cannot process it efficiently due to the difference in the computation type. This motivates the urgent need to design a special-purpose transducer processor. In the global self-attention mechanism, redundant content in human language or images brings about a large number of naturally occurring weak correlation marks (WEAKLY RELATED Tokens, WR-Tokens), and after normalization processing of the softmax function, the attention calculation result is zero or near zero. This can lead to problems with redundant computation, excessive energy consumption, and low hardware utilization, making energy efficient self-attention computation challenging. The approximate calculation is a flexible calculation method, and is suitable for a plurality of application scenes, especially for the situation with certain fault tolerance, such as in the fields of deep learning, image processing and scientific simulation (reference ：Liu W,Lombardi F,Shulte M.A retrospective and prospective view of approximate computing[point of view[J].Proceedings of the IEEE,2020,108(3):394-399), is used for approximate calculation, so that the calculation efficiency can be improved, the model reasoning and experimental process can be accelerated, and the acceptability of the result can not be seriously influenced by small errors. For big data analysis, internet of things devices and embedded systems, approximate computation helps to achieve higher energy efficiency and longer device life with limited resources. In these cases, moderate approximations may speed up data processing and decision making while maintaining overall trends in the data. On the other hand, the explosive development of various transducer models, such as the generative pre-training model transducer 2 (GPT-2), vision Transformer (ViT) and Swin-transducer, has become one of the most important developments in the field of deep learning. The self-attention mechanism plays a vital role in the great success of the transducer model. The computational principle of the self-attention mechanism can be briefly described by first dividing the input sequence into a series tokens and linearly transforming these tokens to obtain Query (Q), key (Key, K) and Value (V) vectors. Then, Q is multiplied by the transpose of K to generate a matrix of attention scores (score matrix) that measure the degree of association between the different tokens. Next, the score in each row of score matrix is normalized with a softmax function, scaling the score index to probability (P), representing the relevance of a particular token to all other tokens. Finally, the probability is quantized and multiplied by V to obtain an output. Each output is a weighted sum of all inputs tokens, with the more relevant tokens (SR-Tokens) having a larger weight. The global attention mechanism contains a large number of less relevant tokens (WR-Tokens), the attention score of WR-Tokens is small, and after the normalization operation of the softmax function, the small attention score is exponentially reduced to a probability of approaching zero, and the contribution of WR-Tokens to accuracy is sharply weakened. However, these small scores occupy the main computational energy consumption, limiting the energy efficiency of the attention block. According to the statistics of literature "Wang Y,Qin Y,Deng D,et al.An energy-efficient transformer processor exploiting dynamic weak relevances in global attention[J].IEEE Journal of Solid-State Circuits,2022,58(1):227-242", the weakly correlated tokens in the transducer model only contributes 6.3% to accuray, but consumes 93.1% of the computational power. The analysis result of literature "Wang Y,Qin Y,Deng D,et al.An energy-efficient transformer processor exploiting dynamic weak relevances in global attention[J].IEEE Journal of Solid-State Circuits,2022,58(1):227-242" shows that 1, a large number of weak correlations tokens have high calculation fault tolerance, and only a small number of strong correlations tokens need to be accurately calculated to ensure the accuracy of a transducer model. Therefore, a design method of approximate calculation can be introduced, the approximate calculation is car