CN-121979488-A - Matrix and vector operation oriented systolic array structure

CN121979488ACN 121979488 ACN121979488 ACN 121979488ACN-121979488-A

Abstract

The application relates to the technical field of digital integrated circuits, in particular to a pulsation array structure facing matrix and vector operation, which comprises a plurality of processing units arranged in a matrix mode, wherein each processing unit is connected in a transverse direction through a first multiplexer, each processing unit is connected in series in a longitudinal direction, each column takes a preset number of processing units as a group, and each group of processing units are connected through a second multiplexer. The application can improve the computation density and parallelism, and improve the utilization rate of hardware resources and the computation efficiency of multiple groups of data during matrix vector computation.

Inventors

LI YUEHANG
HUANG ZHIHONG
CAI GANG
WEI YUCHENG

Assignees

北京中科亿海微电子技术研究院有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A pulse array structure facing matrix and vector operation is characterized by comprising a plurality of processing units which are arranged in a matrix mode, wherein each processing unit is connected in the transverse direction through a first multiplexer, each processing unit is connected in series in the longitudinal direction, each column takes a preset number of processing units as a group, and each group of processing units are connected through a second multiplexer.
2. The matrix and vector operation oriented systolic array structure of claim 1, wherein the two input signals of the first multiplexer are broadcast active data and active data output from a previous processing unit, respectively, and the two input signals of the second multiplexer are output from a previous group of processing units and another input data path, respectively.
3. A matrix and vector operation oriented systolic array structure according to claim 2, where the first multiplexer is configured to implement a broadcast connection of the activation data in the lateral direction.
4. The systolic array structure for matrix and vector operations according to claim 1, wherein the processing unit comprises a weight register for loading weight data, an activation register for storing activation data, and an output register for outputting data of a result of the register operation.
5. The matrix and vector operation oriented systolic array structure according to claim 4, wherein the weight registers of each group of processing units are connected in series in the longitudinal direction, the weight data are input in series, and the weight data between different groups of processing units use different input data paths.
6. The matrix and vector operation oriented systolic array structure according to claim 5, wherein the activation data flows from left to right of the array, performs data operation with the weight data in the processing unit to obtain a portion and data, the portion and data flows longitudinally to an accumulator of a next processing unit, and the accumulator performs an accumulating operation on the portion and data.
7. The matrix and vector operation oriented systolic array structure according to claim 4, wherein the activation data and the weight data of the same group of processing units flow into the corresponding processing units simultaneously, and the activation data and the weight data of the adjacent group of processing units flow into the corresponding processing units at a time interval of one clock cycle.
8. A systolic array structure for matrix and vector oriented operations according to claim 7, wherein the output registers are serially connected in the longitudinal direction.
9. The systolic array structure according to claim 4, wherein the weight registers include a first weight register and a second weight register, the processing unit includes a third multiplexer for determining weight data for performing the operation, and two input signals of the third multiplexer are the first weight data from the first weight register and the second weight data from the second weight register, respectively.
10. The structure of claim 9, wherein the processing unit includes a fourth multiplexer for configuring a data stream mode, and a fifth multiplexer for determining an output result, two input signals of the fourth multiplexer are respectively the operation result data of the last processing unit and the operation result data of the present stage, and two input signals of the fifth multiplexer are respectively the operation result data of the present stage and the operation result data output by the next processing unit.

Description

Matrix and vector operation oriented systolic array structure Technical Field The application belongs to the technical field of digital integrated circuits, and particularly relates to a pulse array structure for matrix and vector operation. Background In recent years, the field of artificial intelligence has revolutionized, and one of the core driving forces is the rise of the Large Language Model (LLM). These models exhibit unprecedented capabilities in the tasks of natural language understanding, generating, translating, and reasoning. The current state-of-the-art LLM is based on a transducer architecture. The transducer architecture was proposed in 2017 with a unique self-attention (self-attention) mechanism and a high degree of parallelization capability. Most of calculation tasks in the transform algorithm are mainly matrix-matrix operation and matrix-vector operation, and the operation is characterized by very large matrix scale and data volume, however, the operation efficiency directly determines the throughput and energy efficiency ratio of the algorithm deployed on the system. Conventional von neumann architecture processors require frequent reads of operands from memory and write back of intermediate results, creating a "memory wall" bottleneck, and to alleviate this problem, the industry commonly employs on-chip multi-level caches, high bandwidth HBMs, and dedicated acceleration units, but are still limited by off-chip memory bandwidth and power consumption. In the process of reasoning a large language model based on a transducer, the process is divided into two stages of pre-filling (prefill) and decoding (decoder), the pre-filling stage inputs an input prompt (sample) into the model once and prepares for subsequent decoding, the stage is computationally intensive and high in parallelism, the decoding stage generates new tokens one by one in an autoregressive manner based on KV Cache generated in the pre-filling stage and a first Token, the problem of the stage is that the memory bandwidth is intensive, the operation is matrix-vector multiplication (GEMV) of 'small batch and high concurrency', the attention of only one Token needs to be calculated each time, but the process is a serial process, the large-scale parallel calculation cannot be performed through GPU, and the current industry solution is to calculate a plurality of values in a batch processing mode and alleviate the defect of low utilization rate of hardware units. The pulsation array (Systolic Array) is an array formed by a plurality of repeated processing units, and the core idea is to enable data to flow, so that access cost is converted into on-chip data multiplexing, access times are reduced, the structure is more regular, wiring is more uniform, frequency is improved, and overlapping of calculation and communication is realized. However, in the decoding stage, a matrix-vector operation is required, and an operation of simultaneously plural sets of data occurs. Systolic arrays suffer from low hardware utilization when computing matrix-vectors, and cannot implement operations on multiple sets of data simultaneously when using batch techniques, resulting in low computational efficiency. Disclosure of Invention The application discloses a pulsation array structure for matrix and vector operation, which can improve the computation density and parallelism and improve the hardware resource utilization rate and the operation efficiency of multiple groups of data during matrix and vector operation. Other objects and advantages of the present application will be further appreciated from the technical features disclosed in the present application. To achieve one or a part or all of the above or other objects, in a first aspect, the present application provides a systolic array structure for matrix and vector operation, where the systolic array structure includes a plurality of processing units arranged in a matrix, each processing unit is connected in a transverse direction by a first multiplexer, each processing unit is connected in series in a longitudinal direction, each column uses a preset number of processing units as a group, and each group of processing units is connected by a second multiplexer. Further, the two input signals of the first multiplexer are respectively broadcast type activation data and activation data output by a previous processing unit, and the two input signals of the second multiplexer are respectively output by a previous group of processing units and another input data path. Further, the first multiplexer is configured to implement broadcast connection of the activation data in a lateral direction. Further, the processing unit comprises a weight register for loading weight data, an activation register for storing activation data, and an output register for outputting data of a registered operation result. Further, the weight registers of each group of processing units are connected in series in the longitu