CN-121979580-A - Fluid pulsation accelerator architecture for large language model reinforcement learning

CN121979580ACN 121979580 ACN121979580 ACN 121979580ACN-121979580-A

Abstract

The invention discloses a fluid pulsation accelerator architecture for reinforcement learning of a large language model, and belongs to the technical field of artificial intelligence hardware acceleration. The system aims to solve the problems that the instruction driving paradigm and the data flow calculation of the current general processor are not matched when the RLHF workload is executed, the fixed parallel granularity cannot adapt to the dynamic load, and the efficiency is low due to the fact that the architecture does not perceive the data statistics. The core of the system is a dynamically reconfigurable liquid systolic array computing fabric, which is triggered to execute by data flow, eliminating kernel boundaries and instruction overhead. The system integrates a global lookup table subsystem, optimizes nonlinear calculation by combining inquiry and parallel lookup by utilizing data distribution priori, is provided with a dynamic scheduling unit, and realizes flexible parallel by adopting a resource allocation algorithm supporting work stealing. The invention can remarkably improve throughput and energy efficiency in RLHF training and large language model reasoning stages, and provides a new design paradigm for the next generation of AI special computing architecture.

Inventors

CAO XI
MIAO JIYUAN
JI YINHUAN

Assignees

北京化工大学

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (10)

1. A fluid pulse accelerator architecture for large language model reinforcement learning, comprising an array of configurable computational cells for performing computations involving linear and nonlinear operations.
2. The hardware accelerator system of claim 1, which is a hardware accelerator system specifically designed to accelerate a large language model based reinforcement learning human feedback training process, wherein the system is an open source hardware acceleration architecture designed specifically for RLHF workloads worldwide, and the three fundamental problems of paradigm collision between a transducer data stream and instruction driven hardware, mismatch of fixed parallel granularity and dynamic load, and no architecture perception on data statistics are systematically solved by a dynamically reconfigurable computing paradigm called fluid systolic array.
3. The hardware accelerator system of claim 2, wherein the runtime dynamic reconfiguration computation is embodied as a fluid systolic array that uses a pure data stream driven execution paradigm, whose core is a continuous computation stream that is triggered entirely by data availability, without instruction fetch and decode overhead, merging linear and nonlinear operators into the same systolic data stream path, thereby eliminating hardware stalls, kernel start-up delays, and explicit memory writebacks of intermediate results due to operator switching.
4. The hardware accelerator system of claim 3, wherein the fluid systolic array is formed by a systolic interconnection network with a plurality of fluid processing blocks connected directly through low delay and registers, forming a hierarchical computation structure, wherein the plurality of fluid processing blocks are dynamically aggregated into a fluid processing block, the plurality of fluid processing blocks form clusters, the plurality of clusters are distributed on different equipotential surfaces, each fluid processing block is used as a basic execution unit, and the hardware accelerator system is characterized in that any number of fluid cores can be dynamically bound during operation through a programmable fluid core mask, so that a data processing pipeline with a non-fixed function optimized for a specific operator sequence is formed, and the systolic interconnection network establishes a cyclic data path between the equipotential surfaces and the cross equipotential surfaces, so that the autoregressive decoding process can form closed-loop computation in a chip and frequent interaction with an off-chip memory is avoided.
5. The hardware accelerator system of claim 4, wherein the fluid core internally integrates a hybrid precision computing unit comprising a floating point 16-bit multiplier array for performing matrix block multiplication, a floating point 32-bit accumulation tree for high precision reduction, and format conversion logic between floating point 16-bits and 32-bits, the fluid processing block supporting two configurable execution paths, a standard computing path, wherein the output of the fluid core is format converted and enters the floating point 32-bit accumulation tree for reduction and then sent to the nonlinear computing unit, and a fast low precision computing path, dedicated to query-key multiplication plus path optimization in an attention mechanism, wherein the floating point 16-bit multiplication result of the fluid core directly bypasses the floating point 32-bit accumulation tree and is sent to a subsequent nonlinear computing unit, the selection of which is explicitly specified by a scheduling unit instruction or adaptively triggered by monitored intermediate data distribution characteristics.
6. The hardware accelerator system of claim 3 further comprising a global look-up table subsystem for providing uniform, low-latency nonlinear function approximation for all fluid processing blocks, the subsystem employing a centralized query-broadcast architecture instead of dedicated function units distributed within each processing unit, the innovation being the ability to utilize statistical distribution priors of large language model intermediate data, eliminate redundant computation by merging duplicate or neighboring query requests from multiple fluid processing blocks, and achieve high throughput using parallel look-up mechanisms.
7. The hardware accelerator system of claim 6, wherein the global lookup table subsystem comprises a precision lookup table that distributes the range of input values of the nonlinear function to multiple parallel memory banks through a staggered address mapping algorithm to address access hot spot problems due to gaussian distribution of input data, and integrates a hot spot aware memory bank replication mechanism, i.e., monitors each memory bank access frequency through a run-time counter, dynamically creates a copy thereof when a hot spot memory bank is detected and dispatches split queries through polling, a shared adaptive cache lookup table for storing recent query results, entries of which contain function values and function identification masks and employ a hybrid replacement strategy based on operator usage frequency priors and numerical statistics, a newton iterative rollback module for handling cache misses and guaranteeing computation accuracy, and a global query merge and route module for receiving query vectors from each fluid processing block, performing global range inspection, deduplication, and distributing the merged query to the precision lookup table and the adaptive lookup table, and finally broadcasting back all source request results.
8. The hardware accelerator system of claim 4, further comprising a dynamic scheduling unit as a global controller responsible for mapping computing tasks to physical resources, the core function of which is to perform dynamic fluid core-to-fluid processing block binding, i.e., to be able to disassemble and independently and flexibly assign different functional blocks within a fluid core to different fluid processing blocks, the scheduling unit maintaining a fluid core configuration table and a scheduling state table, reconstructing the computing power of the fluid processing blocks in real time based on workload parameters in the input instruction stream and the resource utilization and task progress feedback collected at run-time.
9. The hardware accelerator system of claim 2, wherein the system is optimized at a hardware level depth by utilizing a statistical priori of Gaussian distribution of large language model intermediate data, the optimization is embodied in the design and workflow of the global lookup table subsystem, a large number of similar queries for a Gaussian distribution concentrated region are combined into one or a few lookup operations through the combined query mechanism, concurrent access of a distributed dense region is efficiently processed through the staggered address mapping and hot spot perception replication, high-value entries near a distribution center are preferentially reserved through a numerical statistics-based replacement strategy in the adaptive cache lookup table, and the mechanisms work cooperatively to essentially eliminate redundant nonlinear computation on the data with Gaussian distribution characteristics, so that intelligent hardware optimization driven by data is realized.
10. A data processing method based on a hardware accelerator system according to any one of claims 1 to 9 is characterized by comprising the steps of analyzing a workload instruction by a dynamic scheduling unit, determining a binding relation from a fluid core to a fluid processing block through a dynamic fluid core allocation algorithm according to a running state, configuring an execution path of the fluid processing block, wherein the dynamic fluid core allocation algorithm supports resource redundancy and work stealing, and comprises the steps of reserving a part of fluid cores as a redundant resource pool when a system is initialized, firstly activating redundant fluid cores when all conventional fluid cores are completely allocated in the execution process, triggering work stealing if all the fluid cores are still in a saturated state, temporarily collecting idle fluid cores from concurrent calculation tasks with lower current loads, distributing the idle fluid cores to high-load tasks, returning the stolen fluid cores to original tasks through a context recovery mechanism after completing temporary tasks, injecting input tensor data into the fluid pulsation array, enabling the data to flow and trigger calculation among the fluid cores according to the configuration, forming a continuous flow type without a core boundary, automatically combining the fluid cores in the execution process with a global lookup table by a global lookup table, automatically generating a global lookup table by combining the overall lookup table with a global lookup table after the overall lookup table, and automatically performing the overall lookup table until the overall lookup table is completed.

Description

Fluid pulsation accelerator architecture for large language model reinforcement learning Technical Field The invention relates to the field of artificial intelligence hardware accelerators, in particular to a dynamically reconfigurable fluid pulsation array (Liquid Systolic Array, LSA) hardware architecture, a system and a working method which are specially designed for Reinforcement Learning Human Feedback (RLHF) workload based on a Large Language Model (LLM). The invention is the open source hardware accelerator architecture disclosed for the first time worldwide, specifically designed for RLHF workloads. Background With the breakthrough progress of Large Language Models (LLMs) in complex reasoning tasks, reinforcement Learning (RLHF) based on human feedback has become a key technology to improve LLM alignment and performance. However, the RLHF training process, particularly the strategic optimization stage (e.g., GRPO, GHPO, etc. algorithms) at its core, requires multiple candidate responses (e.g., 4-64 samples) to be generated and evaluated in parallel for each hint word, which results in extremely high computational costs. Detailed performance profiling suggests that the response generation (i.e., LLM reasoning) phase consumes over 90% of the overall policy optimization time, which is a fundamental bottleneck. Existing mainstream computing hardware, such as Graphics Processors (GPUs), face three fundamental, architecture mismatch problems that cannot be resolved by software optimization when dealing with RLHF workloads, due to their von neumann architecture nature: 1. The problem of pattern conflict is that the GPU adopts an execution mode driven by instructions. Even with the application of advanced software optimizations such as KV cache, flashAttention-2/3, etc., execution of each operator (e.g., matrix multiplication, softmax) still requires instruction front-end decoding, register scheduling, and forced intermediate result writing back to memory. These overheads cut the continuous data stream inherent in the transducer model into discrete "kernel sandwich" modes, resulting in memory movement (consuming >30% delay) and nonlinear computation (such as exponential operations in Softmax) as major bottlenecks. Experiments show that even after optimization, the LLM inference computation/memory time ratio (C/M) of the GPU is only between 0.35 and 0.46, which is severely limited by the memory bandwidth. 2. Parallel granularity rigidization problem GPU performs parallel scheduling based on a fixed size thread bundle (Warp, typically 32 threads). Whereas RLHF-driven LLM inference workload has the highly dynamic and heterogeneous nature that the sequence length varies continuously from 128 to over 1024, the batch size is adjusted due to sample parallelism, and the linear operators (matrix multiplication) and nonlinear operators (Exp, log, sqrt) are interleaved frequently, fine-grained within the attention layer. The fixed Warp granularity does not optimize both short sequences (requiring fine granularity parallelism) and long sequences (requiring coarse granularity parallelism) at the same time, resulting in serious resource inefficiency. Experimental measurements showed that at a short sequence of 128 tokens, warp execution efficiency was as low as 38.2%, meaning that more than 60% of the computing resources were wasted, while the load variance between Stream Multiprocessors (SMs) was as high as 0.42, indicating severe load imbalance. 3. And the general architectures such as GPU and the like lack the perception capability of calculating the statistical characteristics of data. Through analysis of tensors in the middle of the LLM reasoning process, key data (such as sub-block products of Q x K-T under the granularity of tensor core calculation in an attention mechanism) are found to show a highly concentrated Gaussian distribution rule. However, current hardware cannot take advantage of this statistical a priori knowledge. The isolated thread execution model cannot share partial results across threads or sense the change of data distribution during running, so that the self-adaptive micro-architecture adjustment is triggered, and a great deal of redundant calculation is performed on nonlinear functions such as exponentials, logarithms and the like of the same or similar input values. While the conventional systolic array architecture can eliminate instruction overhead by contiguous data movement between processing units (PEs), it is not efficient to process the above-described dynamically mixed operator sequences by its solidified isomorphic computing unit design with matrix multiplication as the absolute core. The method can still interrupt the data flow for nonlinear calculation, cannot adapt to dynamic parallel granularity, and further has no data statistics sensing capability. Therefore, the prior art schemes are all progressive optimization in the traditional architecture paradigm, and can not fundamenta