CN-121996426-A - Eso-LMs hardware acceleration system for dynamic sequence length block attention calculation based on FPGA

CN121996426ACN 121996426 ACN121996426 ACN 121996426ACN-121996426-A

Abstract

The invention discloses an autoregressive and mask diffusion double-paradigm fusion language model hardware acceleration system based on an FPGA, and relates to the field of FPGA and machine learning. Aiming at the characteristics of combining a language model with an autoregressive model and a mask diffusion model and adopting a revised attention mechanism and KV cache parallel generation, the invention provides a hardware acceleration architecture oriented to a model core computing process. The invention is implemented by taking Eso-LMs (Esoteric Language Models) model as an example. The system comprises a layer normalization and self-adaptive layer normalization modulation module, a QKV projection module, a rotary position coding application module, a KV cache management module, a multi-head attention calculation module, an output projection and residual error connection module and a multi-layer perceptron module. Through optimizing the calculation sequence and data flow of each module in the model calculation process and adopting the pipeline parallel and resource multiplexing technology, the invention realizes the efficient reasoning acceleration of the language model fused with the autoregressive and mask diffusion double-paradigm on the FPGA platform, and remarkably improves the model reasoning speed. The system fully exerts the advantage of customizable hardware acceleration of the FPGA, builds a full-flow pipeline processing architecture by reconstructing the data flow, improves the utilization efficiency of hardware resources, eliminates the blocking of the data pipeline, reduces the calculation delay and the storage overhead, and is suitable for the deployment requirement of an edge calculation scene.

Inventors

HUANG YIHUA
LIU BOHUAI
ZHOU JUNHAO
XU RUI

Assignees

中山大学
柳博怀

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (10)

1. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length block attention calculation is characterized by comprising a first layer normalization and adaptive layer normalization modulation module, a QKV projection module, a rotary position coding application module, a KV cache management module, a block attention calculation module, an output projection and gating residual error connection module, a second layer normalization and adaptive layer normalization modulation module, an MLP module and a second gating residual error connection module, wherein the hardware acceleration system divides the calculation process of each Transformer Block of Eso-LMs model into 12 pipeline stages and outputs data streams in a pipeline form; The first layer normalization and self-adaptive layer normalization modulation module is a first flow level and is used for carrying out layer normalization processing on the input feature vector, obtaining self-adaptive layer normalization modulation parameters through table lookup according to time conditions and modulating the normalized feature vector; the QKV projection module is a second pipeline stage and is used for projecting the modulated feature vector into a query vector Q, a key vector K and a value vector V through linear transformation; the rotary position coding application module is a third stream level and is used for applying rotary position coding to the front head_dim/2 dimension of the query vector Q and the key vector K, and the value vector V does not apply rotary position coding; the KV cache management module is a fourth pipeline stage and is used for combining a key vector K and a value vector V at the current moment with a key vector and a value vector of the history cache to realize dynamic update of KV cache; the block attention calculation module is a fifth pipeline stage, a sixth pipeline stage and a seventh pipeline stage, and Q@K-T calculation, activation function normalization and weighted V aggregation operation are respectively carried out on the block attention calculation through three-stage pipeline; the output projection and gating residual error connection module is an eighth pipeline stage and is used for carrying out linear projection transformation on attention output and carrying out residual error connection with an input characteristic vector through a gating mechanism; the second layer normalization and self-adaptive layer normalization modulation module is a ninth pipeline stage and is used for carrying out layer normalization processing on the input feature vector and modulating by applying self-adaptive layer normalization modulation parameters; the MLP module is a tenth flowing water stage and an eleventh flowing water stage, and the modulated feature vectors are subjected to first-layer linear transformation, GELU activation and second-layer linear transformation through two-stage flowing water respectively; The second gating residual error connection module is a twelfth pipeline stage and is used for carrying out residual error connection on MLP output and input characteristic vectors through a gating mechanism to obtain output characteristic vectors.
2. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length block attention calculation according to claim 1, wherein the first layer normalization and self-adaptation layer normalization modulation module comprises a layer normalization unit, a self-adaptation layer normalization modulation parameter lookup table storage unit and a lookup table and multiplication and addition fusion unit, the layer normalization unit is used for calculating the mean value and variance of an input feature vector and carrying out normalization processing on the feature vector, the self-adaptation layer normalization modulation parameter lookup table storage unit pre-calculates modulation parameters of all diffusion steps and stores the modulation parameters in BRAM/URAM, and a modulation calculation formula is as follows: Wherein: x is normalized eigenvector Scale is the adaptive layer normalized modulation parameter (scaling parameter) Shift is the adaptive layer normalized modulation parameter (offset parameter) The table look-up and multiply-add fusion unit fuses the table look-up operation and the modulation calculation into a pipeline stage.
3. The FPGA-based dynamic sequence length block attention computation Eso-LMs hardware acceleration system of claim 1, wherein the QKV projection module performs matrix multiplication on the input feature vector and the weight matrix w_ qkv to obtain QKV tensors, and separates the QKV tensors into a query vector Q, a key vector K and a value vector V through a rearrangement operation.
4. The Eso-LMs hardware acceleration system based on FPGA (field programmable gate array) dynamic sequence length block attention calculation according to claim 1, wherein the rotary position coding application module comprises a part-dimensional rotary position coding unit, a rotation_half special unit, a cos/sin lookup table optimizing unit and a Q/K parallel processing unit, wherein the part-dimensional rotary position coding unit only processes rotation of a front head_dim/2 dimension, a rear head_dim/2 dimension directly passes through the part-dimensional rotary position coding unit, the calculated amount is reduced by 50%, the rotation_half special unit hardware realizes data rearrangement logic [ x0, x1, x2, x3. ] ] to [ -x1, x0, -x3, x2 ] ], simple data rearrangement logic is used instead of multiplication operation, hardware resource consumption is reduced, the sine and cosine lookup table optimizing unit prestores sine and cosine values in BRAM/URAM according to a position index lookup table, the Q/K parallel processing unit enables parallel processing of Q and K parallel processing to be executed, parallel processing efficiency is improved, and parallel processing pipeline parallel processing is supported by the Q/K processing unit.
5. The Eso-LMs hardware acceleration system for FPGA-based dynamic sequence length block attention computation of claim 1, wherein said KV cache management module comprises a dynamic sequence length driven cache allocation control unit, an L1/L2 layered cache storage unit, a double-buffer KV cache unit, an incremental update computation unit, a predictive prefetch control unit, and a cache line alignment access optimization unit; The dynamic sequence length driven cache allocation control unit dynamically adjusts an L1/L2 cache allocation strategy according to the current sequence length and window position, but not the fixed cache size or static allocation, and fully utilizes the hierarchical storage resources of the FPGA, wherein the L1/L2 hierarchical cache storage unit comprises an L1 cache (URAM/BRAM, low-delay access) and an L2 cache (DDR, high-capacity storage), and is optimized aiming at the characteristics of the FPGA resources; The incremental update calculation unit only calculates and stores K/V of a new token at a hardware level, multiplexes the history cache, and reduces calculation and storage cost; the predictive pre-fetching control unit predicts the K/V position which is possibly accessed in the next step according to the window position, loads in advance on a hardware level and hides DDR access delay; the cache line alignment access optimization unit optimizes DDR access modes, merges access requests of adjacent positions in batches, and improves the bandwidth utilization rate.
6. The Eso-LMs hardware acceleration system for FPGA-based dynamic sequence length block attention calculation of claim 1, wherein the block attention calculation module comprises a block calculation hardware unit, a prefix accumulation activation function register, a history block and current window block separation processing control unit, an on-chip fusion calculation pipeline, an adaptive block control logic and a block-level KV cache access optimization unit; The partitioning calculation hardware unit is a fifth pipeline stage, a dynamically-increased K sequence is divided into blocks with fixed sizes, a GEMM unit is used for calculating Q@K_block≡T block by block, the calculation result of each block is stored in an on-chip BRAM/URAM, the complete Q@K≡T matrix is prevented from being calculated once, the on-chip cache capacity requirement is reduced, and the partitioning attention score calculation formula is as follows: Wherein: q is a query vector K_block is a key vector of a partition K_Block≡T is the transpose of the block key vector The prefix cumulative activation function register is a sixth pipeline stage, special registers are used for maintaining cross-block activation function statistical information (max_global_reg and sum_exp_reg), and after each block is processed, the registers are updated to support cross-block activation function normalization, and the need of storing a complete attention score matrix in DDR is avoided; The control unit for separating the historical block from the current window block comprises a historical block processing path and a current window block processing path, skips a mask application unit when a hardware control logic detects the historical block, directly performs Q@K-degree T calculation, can buffer the result to an on-chip BRAM, enables the mask application unit when the hardware control logic detects the current window block, completely calculates and applies a causal mask, and avoids pipeline pause caused by conditional branching through the mask pre-calculation and the quick application unit; the self-adaptive blocking control logic dynamically selects a blocking strategy according to the sequence length of K and the on-chip cache capacity, the short sequence mode selects a one-time computing mode, the long sequence mode selects a blocking mode, the block size dynamically calculates according to the on-chip cache capacity and the sequence length, the pipeline configuration is dynamically adjusted to optimize the throughput, the block-level KV cache access optimizing unit reads K/V data of historical blocks from the KV cache according to the blocks, and the block-level KV cache access optimizing unit reads the K/V data in batches by using a DMA or special cache controller, reduces random access overhead and supports pipeline data flow.
7. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length block attention calculation according to claim 1, wherein the output projection and gating residual connection module comprises an output projection unit and a gating residual connection unit, and the calculation formula of gating residual connection is as follows: Wherein: gate is the gating value New_value is the feature vector after output projection Skip_value is an input feature vector (feature vector before multi-head self-attention) The gating residual error connection unit fuses the gating multiplication and the residual error addition into one hardware unit, so as to realize the fusion calculation of gate_value+skip_value, data flow on a chip (gate value, new_value and skip_value are read from the on-chip BRAM/URAM, and the calculation result is directly written back), thereby avoiding intermediate result writing back to DDR and reducing memory access delay and bandwidth consumption.
8. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length Block attention calculation according to claim 1 is characterized in that the second layer normalization and self-adaptation layer normalization modulation module comprises a layer normalization unit, a self-adaptation layer normalization modulation parameter lookup table storage unit and a lookup and multiplication and addition fusion unit, wherein the self-adaptation layer normalization modulation parameter lookup table storage unit pre-calculates modulation parameters of all diffusion steps and stores the modulation parameters in BRAM/URAM, if all blocks share AdaLN parameters, only one lookup table is needed to reduce storage resource consumption, and the lookup and multiplication and addition fusion unit fuses lookup operation and modulation calculation (x (1+scale) +shift) in a pipeline stage to reduce intermediate data access and pipeline pause.
9. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length block attention calculation according to claim 1, wherein the MLP module comprises a first layer linear transformation unit, a GELU activation function unit and a second layer linear transformation unit, the first layer linear transformation unit is a tenth pipeline stage and is used for carrying out matrix multiplication operation on a modulated feature vector and a weight matrix W_ MLP1, the GELU activation function unit and the second layer linear transformation unit are eleventh pipeline stages, GELU activation and the second layer linear transformation are respectively carried out on output of the first layer linear transformation through two-stage pipeline, GELU activation function is realized by using a table lookup method, floating point operation units are avoided, and hardware resource consumption is reduced.
10. The Eso-LMs hardware acceleration system based on FPGA dynamic sequence length block attention calculation according to claim 1, wherein the second gating residual error connection module comprises a gating residual error fusion hardware unit, a gating value hardware calculation unit, a double skip value cache management hardware, a conditional skip control logic, a gating residual error pipeline fusion unit and a gating multiplier resource multiplexing unit, wherein the gating residual error fusion hardware unit fuses gating multiplication and residual error addition into one hardware unit, so that the fusion calculation of gate_new_value+skip_value is realized, data flows on a chip, intermediate result write-back DDR is avoided, and memory access delay and bandwidth consumption are reduced; The calculation of the gating value hardware calculation unit uses a hardware table look-up or piecewise linear approximation to realize nonlinear functions such as sigmoid and the like, a floating point operation unit is avoided, hardware resource consumption (DSP and LUT) is reduced, the gating value can be quantized to INT8 or stored in a fixed point format, and the table look-up operation and subsequent calculation are fused in a pipeline; The double skip value cache management hardware uses on-chip BRAM/URAM to cache two skip values (x_skip and x_skip 2), adopts a dual-port BRAM or time-sharing multiplexing strategy, supports pipeline execution of two gate-controlled residual connections of MSA and MLP, optimizes a cache access mode, reduces cache conflict, and caches and forwards skip values in advance to avoid pipeline pause caused by waiting for completion of residual connection calculation; The hardware detection unit of the conditional skip control logic detects whether the gate value is smaller than a preset threshold value, if the gate value is very small, the hardware control logic directly outputs skip_value through the MUX selector, skips multiplication operation of the gate_new_value, saves multiplier resources and power consumption, reduces calculation delay, and avoids pipeline pause caused by conditional branches; The gating residual pipeline fusion unit fuses gating residual calculation and front and rear stage modules (such as a normalization module and an activation function) in the same hardware pipeline, gate calculation, multiplication, addition and subsequent processing are continuously executed in the pipeline, pipeline pause and intermediate data buffering requirements are reduced, two gating residual connections after the gating multiplier resource multiplexing unit MSA and after the MLP share the same set of gating multiplier hardware resources, hardware resource consumption is reduced through time multiplexing, and the hardware control unit schedules the two gating residual connections to sequentially use multipliers.

Description

Eso-LMs hardware acceleration system for dynamic sequence length block attention calculation based on FPGA Technical Field The invention relates to the technical field of FPGA (field programmable gate array) hardware acceleration, in particular to a Eso-LMs (Esoteric Language Models) hardware acceleration system for dynamic sequence length block attention calculation based on an FPGA, which is used for efficiently realizing the reasoning acceleration of a Eso-LMs model fusing an autoregressive and mask diffusion double-paradigm on an FPGA platform. Background With the rapid development of deep learning technology, large language models (Large Language Models, LLMs) have achieved significant results in tasks such as natural language processing, text generation, and the like. However, the traditional large language model has the problems of high calculation complexity, large memory occupation, long reasoning delay and the like in the reasoning process, and particularly under the edge calculation scene, the limitations are more prominent. Eso-LMs (Esoteric Language Models) is a novel language model that fuses the two paradigms of autoregressive (Autoregressive, AR) and mask diffusion (Masked Diffusion, MDM), whose core structure is based on the Diffusion Transformer (DiT) architecture. Unlike traditional autoregressive models or diffusion models, eso-LMs models have the characteristics of dynamic sequence length, variable window size, partial masking mechanism, mixed generation mode, dynamic KV buffer growth and the like in the reasoning process. Specifically, the sequence length is gradually increased from an initial window to a complete sequence instead of a fixed length, the window size is dynamically adjusted according to a generation stage, a mixed generation mode of a diffusion stage and an autoregressive stage is supported, in the attention calculation, a history token is not limited by masks, only the token of the current window applies a causal mask, which is different from a full-sequence mask or a full-no mask mechanism of a traditional attention mechanism, the diffusion stage and the autoregressive stage are alternately used in a reasoning process by a model, the two modes are dynamically switched in the same cycle, each layer needs to maintain independent KV cache, and the length of the cache sequence is dynamically increased and needs to be high-efficiency cache management mechanism. These features make Eso-LMs models difficult to implement efficiently on a general-purpose GPU or CPU platform. The traditional neural network acceleration scheme is generally designed aiming at fixed sequence length, hardware resources are allocated statically, the dynamic sequence length characteristics of Eso-LMs cannot be adapted, when the sequence length is increased dynamically, buffer allocation, blocking strategies and the like are required to be adjusted dynamically, and the existing scheme is difficult to support. The Eso-LMs model needs to maintain independent KV caches for multiple layers, the length of a cache sequence is dynamically increased, a traditional scheme generally adopts a fixed window size or static allocation strategy, and hierarchical storage resources (URAM/BRAM and DDR) of the FPGA cannot be fully utilized, so that the cache access efficiency is low. In the aspect of attention computation, the sequence length of query vectors of Eso-LMs models is relatively fixed (current window), the sequence length of value vectors is dynamically increased, traditional FlashAttention and other schemes are mainly designed for parallel processing of a plurality of tokens by using a plurality of streaming multiprocessors, and activation functions are locally reset in blocks, so that the method is not suitable for a resource-limited environment of an FPGA single token processing unit. The Eso-LMs model applies causal masking only to the current window, the history token is not limited by masking, and conventional masking hardware typically implements full sequence masking or no masking at all, and cannot support selective application of partial masking, resulting in wasted computational resources. The Eso-LMs model needs to dynamically switch a diffusion mode and an autoregressive mode in the reasoning process, the window size and the processing strategy need to be dynamically adjusted, the traditional scheme usually fixes a generation mode, and the dynamic switching of a mixed mode cannot be supported. In addition, the Eso-LMs model uses a gated residual connection, two jump connection values (before MSA and before MLP) need to be saved, and the conventional scheme generally realizes a simple addition residual connection, so that hardware fusion of the gated residual connection cannot be realized efficiently. Therefore, a special FPGA hardware acceleration system aiming at Eso-LMs model characteristics is urgently needed, and efficient reasoning acceleration of the Eso-LMs model on an FPGA platform is real