CN-121979489-A - Multi-precision MAC tree type processing unit and pulse array structure

CN121979489ACN 121979489 ACN121979489 ACN 121979489ACN-121979489-A

Abstract

The application relates to the technical field of digital integrated circuits, in particular to a multi-precision MAC tree-type processing unit and a pulse array structure, wherein the processing unit comprises an MAC tree-type structure, the MAC tree-type structure comprises a multi-precision multiplication module, an addition tree module and a node structure, the multi-precision multiplication module is used for carrying out data multiplication operation on a plurality of groups of first precision data and second precision data to determine corresponding first product results and second product results, the addition tree module comprises a first hierarchical structure and a second hierarchical structure used for determining product accumulation results, the first hierarchical structure comprises N node structures, the node structure comprises an adder and a first data selector, the adder is used for receiving the first product results to carry out addition operation and generate first addition results, and two paths of input signals of the first data selector are the first addition results and the second product results respectively. The application can optimize the problems of operation delay and storage bottleneck in the matrix operation process.

Inventors

LI YUEHANG
HUANG ZHIHONG
CAI GANG
WEI YUCHENG

Assignees

北京中科亿海微电子技术研究院有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. The multi-precision MAC tree processing unit is characterized by being applied to a systolic array structure, the processing unit comprises an MAC tree structure for realizing multiply-accumulate operation, an accumulator for executing the accumulate operation and an output register for outputting operation results, wherein the MAC tree structure comprises the following components: The multi-precision multiplication module is used for carrying out data multiplication operation of a plurality of groups of first precision data and second precision data and determining corresponding first product results and second product results, wherein the number of bits of the first precision data is smaller than that of the second precision data; The system comprises an addition tree module, a first data selector and a second data selector, wherein the addition tree module comprises a first hierarchical structure and a second hierarchical structure used for determining a product accumulation result, the first hierarchical structure comprises N node structures, the node structures comprise adders and the first data selector, the adders are used for receiving the first product result to carry out addition operation and generate a first addition result, and two paths of input signals of the first data selector are the first addition result and the second product result respectively.
2. The multi-precision MAC tree processing unit according to claim 1, wherein the multi-precision multiplication module comprises a data preprocessing unit, 2N encoders and 2N compressors, wherein the data preprocessing unit is used for determining the precision of input data and mantissas of the input data, dividing the mantissas into data and sending the data into the 2N encoders respectively, the 2N encoders are used for performing encoding operation to generate 2N partial products and sending the 2N partial products into the 2N compressors respectively, and the compressors are used for performing partial product compression on the partial products to generate carry chain data and home chain data and sending the carry chain data into the addition tree module for addition operation.
3. A multi-precision MAC tree processing unit as claimed in claim 2, characterized in that the number N is related to the precision of the input data.
4. The multi-precision MAC tree processing unit of claim 2, wherein an encoding register is provided between the encoder and a compressor, and a compression register is provided between the compressor and an adder-tree module.
5. The multi-precision MAC tree processing unit of claim 1, wherein the second hierarchy comprises N-1 adders arranged in a tree topology cascade.
6. The multi-precision MAC tree processing unit of claim 5, wherein each adder is followed by a register.
7. The multi-precision MAC tree processing unit of claim 6, wherein the first data selector has one input coupled to a register in the same node structure and another input coupled to the multi-precision multiplication module.
8. The multi-precision MAC tree processing unit as claimed in claim 1, wherein the processing unit further comprises a weight register for registering weight data, an activation register for registering activation data, a pipeline register for buffering intermediate operation results, a second data selector, and a third data selector, wherein an input signal of the second data selector comprises an operation result of a previous stage processing unit and an operation result of a present stage processing unit, and an input signal of the third data selector comprises an operation result of a next stage processing unit and an operation result of a present stage processing unit.
9. A systolic array structure comprising a plurality of the multi-precision MAC tree processing units as claimed in any one of claims 1 to 7 arranged in a matrix.
10. The systolic array structure according to claim 9, wherein the systolic array structure comprises a weight buffer connected to the processing unit for inputting weight data to the processing unit, and an input buffer for inputting activation data to the processing unit, and the weight buffer and the input buffer are respectively used for arranging and temporarily storing multiple sets of weight data and activation data according to data precision.

Description

Multi-precision MAC tree type processing unit and pulse array structure Technical Field The application belongs to the technical field of digital integrated circuits, and particularly relates to a multi-precision MAC tree-type processing unit and a pulse array structure. Background With the rapid development of deep learning technology, large language models (Large Language Models, LLMs) exhibit unprecedented semantic understanding and generating capabilities in the fields of natural language processing, dialogue systems, code generation, aided writing and the like. From early architecture based on Recurrent Neural Networks (RNNs) and long and short term memory networks (LSTM), to the advent of convertors, to very large-scale pre-training models represented by GPT, llaMA, deepSeek et al in recent years, the parameters have rapidly expanded from billions to billions, trillions, and even trillions. At the same time, the model structure is also extended from a single decoder to a diverse topology including encoder-decoder, sparse expert mixture (MoE), etc. However, the explosive growth of parameter size brings massive computing and storage requirements that one forward reasoning often requires hundreds of billions of matrix multiplications to be completed in millisecond time, and frequently accesses off-chip high bandwidth storage (HBM) to handle weight, activate and key value buffering, resulting in the traditional general-purpose CPU and GPU architecture gradually exposing bottlenecks in terms of energy efficiency, latency and scalability. To alleviate the above problems, algorithms and hardware co-optimization routes are commonly adopted in the industry. In the algorithm level, researchers reduce the calculated amount and the access amount through technologies such as sparsification, quantization, distillation, dynamic batch processing, KV-Cache compression and the like, and in the hardware level, various architecture innovations such as tensor cores, pulse arrays, reconfigurable data flow accelerators, near-memory calculation, optical interconnection and the like are developed. The systolic array is one of the main hardware paradigms of matrix multiplication and convolution operation because of high parallelism, regular data flow, low control overhead and good expandability. The core idea of the systolic array is to realize the running water type transmission and the on-site calculation of data through two-dimensional or one-dimensional regularly arranged processing units (Processing Element, PE), so that the memory overhead is converted into on-chip shift, and the dependence on high-bandwidth off-chip storage is obviously reduced. However, the existing systolic array design mainly aims at a Convolutional Neural Network (CNN), the workload characteristics of the systolic array design are obviously different from those of a large language model, the LLM reasoning stage is divided into two stages of pre-filling (prefill) and decoding (decoding), the pre-filling mainly takes high-dimensional matrix-matrix multiplication (GEMM) as a main component, the decoding stage mainly takes small-batch and high-concurrency matrix-vector multiplication (GEMV) as a main component, the two operation characteristics are different, and secondly, the on-chip storage needs to simultaneously accommodate weight, activation and overlong sequence Cache due to the introduction of KV-Cache, so that the capacity and bandwidth requirements of an on-chip static random access memory are greatly increased. Therefore, when the traditional systolic array is applied to large language model reasoning, the problem that balance among utilization rate, energy efficiency and delay is difficult to achieve exists. Disclosure of Invention The application discloses a multi-precision MAC tree-type processing unit and a pulse array structure, which can optimize the problems of operation delay and storage bottleneck in the matrix operation process and realize high throughput, low delay and high energy efficiency reasoning acceleration. Other objects and advantages of the present application will be further appreciated from the technical features disclosed in the present application. To achieve one or a part or all of the above or other objects, in a first aspect, the present application provides a multi-precision MAC tree processing unit, applied to a systolic array structure, where the processing unit includes a MAC tree structure for implementing a multiply-accumulate operation, an accumulator for performing the accumulate operation, and an output register for outputting an operation result, where the MAC tree structure includes: The multi-precision multiplication module is used for carrying out data multiplication operation of a plurality of groups of first precision data and second precision data and determining corresponding first product results and second product results, wherein the number of bits of the first precision data is smaller than that