CN-121981167-A - Edge-end large language model reasoning acceleration method and accelerator

CN121981167ACN 121981167 ACN121981167 ACN 121981167ACN-121981167-A

Abstract

The invention relates to the technical field of network acceleration, and discloses an edge large language model reasoning acceleration method and an accelerator, comprising the following steps of reconstructing a calculation flow of a decoding stage, and carrying out deep fusion on a multi-head attention mechanism and a calculation operation of a feedforward network; the method comprises the steps of storing weight and key value data in an HBM, storing coefficients and accumulated attention scores in a DDR, performing multi-precision matrix operation by using a unified matrix calculation unit for linear matrix calculation, converting a Softmax function into operation based on 2 through a bottom conversion formula for non-linear function calculation by adopting a mathematical transformation and linear fitting method, performing truncation and third-order linear fitting on the Sigmoid function, constructing a key value screening algorithm based on the accumulated attention scores, dynamically adjusting key value storage positions, maintaining a recent key value cache area and an important key value cache area in a limited cache space, and realizing key value high-efficiency cache in long text pushing.

Inventors

ZHANG JUN
LI QIU

Assignees

中南大学

Dates

Publication Date: 20260505
Application Date: 20251230

Claims (10)

1. The edge-end large language model reasoning acceleration method is characterized by comprising the following steps of: by activating a resident data scheduling strategy, reconstructing a calculation flow of a decoding stage, and carrying out deep fusion on a multi-head attention mechanism and a calculation operation of a feedforward network, so as to realize on-chip storage optimization of activated data; Adopting a three-level data storage architecture, comprising an off-chip memory, an on-chip cache and a register, wherein the off-chip memory adopts a hybrid storage mechanism of HBM and DDR, weight and key value data are stored in the HBM, and coefficients and accumulated attention scores are stored in the DDR; for linear matrix calculation, a parallelization hardware mapping strategy is adopted, and a unified matrix calculation unit is used for executing multi-precision matrix operation, wherein the multi-precision matrix operation comprises the support of INT4 XINT 4 direct accumulation and FIX16 XINT 4 shift accumulation; aiming at nonlinear function calculation, adopting a mathematical transformation and linear fitting method, converting a Softmax function into operation based on 2 through a bottom-changing formula, and carrying out truncation and third-order linear fitting on a Sigmoid function; and constructing a key value screening algorithm based on the accumulated attention score, dynamically adjusting a key value storage position, and maintaining a recent key value cache area and an important key value cache area in a limited cache space to realize efficient key value cache in long text pushing.
2. The method for accelerating reasoning of edge-side large language model according to claim 1, wherein in the step of data scheduling and storing, the activation value resides in the on-chip buffer memory in the whole course, duplex access is realized through the pseudo-dual port RAM, the weight parameter is streamed by first word put FIFO, and the auxiliary coefficient is classified and stored in RAM and FIFO.
3. The method for accelerating reasoning of edge-end large language model according to claim 1, wherein in the data calculation step, the matrix calculation unit adopts a 4-bit multiplication component array with 32×32 parallelism to support INT4 and FIX16 precision data operation, and the matrix addition module accumulates and adjusts multiplication results according to operation modes to support a residual connection function.
4. The edge-side large language model reasoning acceleration method of claim 1, wherein in the nonlinear function calculation, a Softmax function is converted by a formula: The exponential operation is converted to a shift operation and the fractional part is calculated by a multi-segment linear fit approximation.
5. The edge-side large language model reasoning acceleration method of claim 1, wherein the Sigmoid function calculation adopts truncation and third-order linear fitting: When the input x is less than or equal to 5x≤ 5, Outputting to be 0; when the input x is more than or equal to 5, the output is 1; When input is made 5<x<5 5< X <5, by a third order polynomial: An approximation calculation is performed.
6. The method for accelerating reasoning of edge-side large language model according to claim 1, wherein in the key-value management step, key-value screening is based on accumulated attention score: , and by a scoring function: And dynamically maintaining a key value cache.
7. The edge-wide language model reasoning acceleration method of claim 1, further comprising a matrix partitioning strategy that divides an input matrix into 32 parallel partitions, adapts 32 attention header computations, and a feed-forward network layer matrix is divided into 32 "pseudo-headers" to unify parallel computation dimensions.
8. The edge-end large language model reasoning acceleration method of claim 1, wherein the method realizes nonlinear function calculation through a multi-stage pipeline, takes a Sigmoid function as an example, adopts a six-stage pipeline architecture, and sequentially carries out symbol processing, range detection, primary term and square term calculation, cube term and constant accumulation, polynomial combination and result adjustment; The method repeatedly executes the steps of data scheduling, matrix calculation, nonlinear function calculation and key value updating in the reasoning process until the whole large language model reasoning is completed.
9. The edge-side large language model reasoning acceleration method according to claim 1, wherein the method is applied to an Llama 2-7B model, and model redundancy is reduced through fine-grained structured pruning and 4-bit quantization, so that edge-side end deployment efficiency is improved.
10. An edge-side large language model inference accelerator, comprising: The global controller is connected with the CPU through an AXI bus and is used for receiving configuration information, generating control and address signals and coordinating the work of each unit; the data scheduling and storing unit comprises a data transmission module, an on-chip storage module and a data preprocessing module and is used for realizing data carrying, storing and preprocessing; The matrix computing unit comprises a matrix multiplication module and a matrix addition module, and supports multi-precision matrix operation and residual connection; The nonlinear function calculation unit comprises a Softmax function, a root reciprocal function and a Sigmoid function calculation module, and approximate calculation is realized by adopting a mathematical transformation and fitting method; the key value dynamic management unit comprises an accumulated attention score calculation module and a key value address adjustment module and is used for dynamically managing key value cache.

Description

Edge-end large language model reasoning acceleration method and accelerator Technical Field The invention relates to the technical field of network acceleration, in particular to an edge-end large language model reasoning acceleration method and an accelerator. Background In recent years, large language models (Large Language Models, LLMs) have demonstrated powerful capabilities in tasks such as natural language processing, intelligent question-answering, content generation, etc., based on the transducer architecture. However, LLMs has large parameters and high computational complexity, and places extremely high demands on hardware computing power and memory bandwidth. Particularly in edge computing scenarios, deployment and reasoning of large language models poses serious challenges, limited by device power consumption, storage capacity, and computing resources. Currently, mainstream large model reasoning acceleration schemes rely mainly on a Central Processing Unit (CPU) or a Graphics Processor (GPU). The CPU has low efficiency when processing massive parallel tasks, and cannot fully play hardware resources, while the GPU has strong parallel computing capability, the GPU has limited storage bandwidth, high power consumption and large delay caused by frequent data transportation, and the GPU is difficult to meet the requirements of low power consumption and high instantaneity of an edge end. The existing acceleration scheme based on the FPGA improves energy efficiency and flexibility to a certain extent, but still has the following problems: the method is characterized by comprising the steps of carrying out hardware-level deep fusion optimization on key operations (such as a multi-head attention mechanism and a feed forward network) in a large language model, carrying out frequent off-chip storage access and low bandwidth utilization rate due to lack of a continuous residence and high-efficiency scheduling mechanism of activated data, carrying out complex realization on nonlinear functions (such as Softmax, sigmoid) hardware, and carrying out balance between precision and resource overhead, wherein key value Cache (KV Cache) is large in memory occupation in long text pushing, and carrying out influence on reasoning speed and energy efficiency due to lack of a dynamic management mechanism. Therefore, a special acceleration method and a hardware architecture for edge-oriented large language model reasoning are needed, and efficient and low-power consumption model reasoning can be achieved under limited resources. Disclosure of Invention The invention provides an edge-end large language model reasoning acceleration method and an accelerator, which are used for solving the problems of lower efficiency and higher power consumption of the existing network acceleration mode. In order to achieve the above object, the present invention is realized by the following technical scheme: in a first aspect, the present invention provides a method for accelerating reasoning of a big language model at an edge, comprising the following steps: The data scheduling and storing step comprises the steps of deeply fusing a multi-head attention mechanism with the calculation operation of a feedforward network by activating a resident data scheduling strategy, reconstructing the calculation flow of a decoding stage, and realizing the on-chip storage optimization of activated data; The data calculation step comprises the steps of performing multi-precision matrix operation by using a unified matrix calculation unit according to a parallelization hardware mapping strategy, wherein the multi-precision matrix operation comprises the support of INT4 xINT 4 direct accumulation and FIX16 xINT 4 shift accumulation; And a key value management step, namely constructing a key value screening algorithm based on the accumulated attention score, dynamically adjusting a key value storage position, and maintaining a recent key value cache area and an important key value cache area in a limited cache space to realize efficient key value cache in long text pushing. Optionally, in the step of data scheduling and storing, the activation value resides in the on-chip buffer in the whole course, duplex access is realized through the pseudo-dual port RAM, the weight parameter adopts the first word to put out the FIFO for streaming, and the auxiliary coefficient is stored in the RAM and the FIFO in a classified manner. Optionally, in the data calculation step, the matrix calculation unit adopts a 4-bit multiplication component array with 32×32 parallelism to support INT4 and FIX16 precision data operation, and the matrix addition module adds and adjusts multiplication results according to an operation mode to support a residual error connection function. Optionally, in the nonlinear function calculation, the Softmax function is converted by a formula: The exponential operation is converted to a shift operation and the fractional part is calculated by a mul