CN-121997984-A - Method for performing attention calculations using a circuit arrangement and circuit arrangement
Abstract
The invention provides a method and a circuit device for executing attention calculation by using a circuit device, an electronic device and a non-transient computer readable storage medium, wherein the circuit device comprises a first buffer module, a second buffer module and a first buffer module, wherein the first buffer module is configured to buffer first input data; the device comprises a first buffer module, a second buffer module, a transposition module, a kernel module and a core module, wherein the first buffer module is used for receiving first input data, outputting transposed data, the core module is used for receiving transposed data, receiving second input data, performing kernel calculation by utilizing the transposed data and the second input data, and inputting calculation result data to the first buffer module. According to the embodiment, the transpose module is arranged behind the first buffer module, so that a plurality of calculation nodes in the attention calculation process are combined in a data stream mode, and the combined data stream paths are executed in parallel, so that the attention calculation performance and efficiency are improved.
Inventors
- JIAO LI
- Cai Quanxiong
- NIU XINYU
Assignees
- 深圳鲲云信息科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251230
Claims (10)
- 1. A circuit arrangement for attention computation, comprising: the first buffer module is configured to buffer the first input data; An address generator configured to control an order in which the first buffer module reads and writes data according to the generated read-write address; A second buffer module configured to buffer second input data; A transpose module configured to receive the first input data from the first buffer module and output transpose data; And a kernel module configured to receive the transposed data from the transpose module, receive the second input data from the second buffer module, perform kernel computation using the transposed data and the second input data, and input computation result data to the first buffer module.
- 2. The circuit arrangement of claim 1, further comprising a quantization module, The quantization module is configured to receive the first input data output by the first buffer module or the calculation result data output by the kernel module to execute quantization operation, and output the quantization result to an external memory.
- 3. The circuit arrangement of claim 2, further comprising a first data selector, The kernel module is further configured to input calculation result data to the first cache module through the first data selector; the quantization module is further configured to output a quantization result into an external memory through the first data selector.
- 4. The circuit arrangement of claim 3, further comprising a second data selector, The core module is further configured to receive the transposed data from the transpose module or the first input data from the first cache module via the second data selector.
- 5. The circuit arrangement of claim 4, further comprising a third data selector, The quantization module is further configured to receive, through the third data selector, the first input data output by the first buffer module or the calculation result data output by the kernel module to perform a quantization operation, and output a quantization result to an external memory.
- 6. The circuit arrangement of claim 1, wherein the address generator is further configured to generate read-write addresses in a transpose operation order and store the first input data to the first buffer module in accordance with the read-write addresses.
- 7. A method of performing attention calculations using a circuit arrangement as claimed in any one of claims 1 to 6, comprising: performing linear operation on the first input data and the second input data by using the kernel module to obtain a linear calculation result; and storing the calculation result to the first buffer module by using the address generator according to the sequence of the transposition operation.
- 8. The method as recited in claim 7, further comprising: Performing transposition operation on the calculation result in the first buffer module by utilizing the transposition module to obtain transposed result data; Performing quantization calculation on the transposed result data by using the quantization module to obtain quantized result data; and outputting the quantized result data to an external storage space to perform a data splicing operation in the external storage space.
- 9. An electronic device, comprising: processor, and A memory storing a computer program which, when executed by the processor, causes the processor to perform the method of claim 7 or 8.
- 10. A non-transitory computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to perform the method of claim 7 or 8.
Description
Method for performing attention calculations using a circuit arrangement and circuit arrangement Technical Field The present invention relates to the field of integrated chip technology, and more particularly, to a method and circuit arrangement for performing attention calculations using a circuit arrangement, an electronic device, and a non-transitory computer readable storage medium. Background In recent years, large language models based on a transducer structure are rapidly developed, the model scale of the large language models is larger and larger, the performance of the models is better and better, and the dialogue application based on various large language models is wider and wider. Because of the immediacy of dialogue application, the model has high delay requirement on the reasoning process, the model scale of the large language model is very large, and meanwhile, the reasoning process has certain sparsity, so the operation of the large language model on hardware needs extremely high memory access bandwidth and calculation power. The core part of the transducer structure is an Attention module (Attention), the structure of which is shown in fig. 1, and the corresponding calculation formula is shown in formula (1). (1) The Q (Query), K (Key), V (Value) are input and are obtained through three similar linear layers, the Q and K branches are embedded with position information through rotation position coding (RoPE) operation, then the Q and K branches are subjected to matrix multiplication (BMM) calculation, then the calculation result is subjected to Softmax operation to obtain characteristic correlation, and finally the V branch is subjected to matrix multiplication BMM to obtain the output of Attention. In a large language model of a dialogue type application, the capability of memorizing history information of the model is generally realized by saving data of K branches and V branches as history data and then adding a calculation process of Attention after each splicing with a new input. This feature of the algorithm model results in that during reasoning, a relatively large external memory space is typically allocated for caching Kcache and Vcache data, and that each time the algorithm model generates an output, the data is read from or written to the Kcache and Vcache caches in the external memory space. For the characteristic of the Attention structure, taking a V branch as an example, when the accelerator sequentially executes each operation node according to an algorithm diagram of the Attention structure, outputs data to an external storage (DDR) for caching, and reads the data from the external storage to the accelerator for calculation when the next operation is executed, and the whole process needs to frequently read and write the external storage. Thus, as the session length increases, the size of Vcache cache data also increases rapidly, and the amount of linear and subsequent BMM computation in the reasoning process increases relatively less, resulting in Vcache data being a bottleneck for execution of the Attention structure between external storage and accelerators. Disclosure of Invention The present invention aims to propose a method and a circuit arrangement for performing attention calculations with a circuit arrangement, an electronic device and a non-transitory computer readable storage medium for solving the problem of low calculation efficiency caused by frequent reading and writing of external data during the attention calculation. According to an aspect of the present invention, there is provided a circuit arrangement for attention calculation, comprising: the first buffer module is configured to buffer the first input data; an address generator configured to control an order in which the first buffer module reads and writes data according to the generated read-write address; A second buffer module configured to buffer second input data; A transpose module configured to receive the first input data from the first buffer module and output transpose data; And a kernel module configured to receive the transposed data from the transpose module, receive the second input data from the second buffer module, perform kernel computation using the transposed data and the second input data, and input computation result data to the first buffer module. According to some embodiments, the circuit arrangement further comprises a quantization module, The quantization module is configured to receive the first input data output by the first buffer module or the calculation result data output by the kernel module to execute quantization operation, and output the quantization result to an external memory. According to some embodiments, the circuit arrangement further comprises a first data selector, The kernel module is further configured to input calculation result data to the first cache module through the first data selector; the quantization module is further configured to output a quant