CN-122018851-A - Multi-precision sparse tensor calculation system and method

CN122018851ACN 122018851 ACN122018851 ACN 122018851ACN-122018851-A

Abstract

The invention discloses a multi-precision sparse tensor calculation unit system and a method, wherein the design realizes a tensor calculation unit supporting various matrix formats, data precision (INT 4 to FP 32) and structured sparsity (1:4, 2:4 and 3:4) through low-order-width multiplier splicing, a two-dimensional broadcasting network and a sparse indexing technology, and the hardware structure of the tensor calculation unit comprises a bus transmission module, a data scheduling module, a cache module and a matrix calculation module, can efficiently execute matrix multiply-accumulate operation, and has the functions of mixed precision calculation and overflow protection. The invention obviously reduces the power consumption while improving the calculation parallelism and the resource utilization rate, and is suitable for the fields of artificial intelligence and high-performance calculation.

Inventors

LIU QIANG
Hua Yihao
SHU MINGYU
DONG RAN

Assignees

天津大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. A multi-precision sparse tensor computing system, comprising: the bus data transmission module (1) is used for receiving control signals from the APB bus and data to be calculated from the AXI bus and sending the calculated data to the AXI bus; the data scheduling module (2) is connected to the bus data transmission module (1) and is used for scheduling input data and storing the input data into the input/output data caching module (3) according to a preset sequence and scheduling the storage of the calculated data; The input/output data caching module (3) is connected to the data scheduling module (2) and is used for caching input data and transmitting the input data into the matrix computing module (4) according to the period, receiving the computing result, and outputting the computing result after rearrangement; and the matrix calculation module (4) is connected to the input/output data caching module (3) and used for performing sparse data processing and matrix multiply-accumulate calculation and transmitting calculation results to the input/output data caching module (3).
2. The multi-precision sparse tensor computing system of claim 1, wherein the matrix multiply-accumulate computation is performed in parallel by a two-dimensional broadcast network organization consisting of routing modules in cooperation with an array of processing units, comprising in particular: In the path of matrix A, data is broadcast along the column direction, and in the path of matrix B, data is broadcast along the row direction to ensure that each computing unit receives different data combinations, so as to implement parallel multiply-accumulate operation.
3. The multi-precision sparse tensor computing system according to claim 1 or 2, wherein the matrix computing module (4) comprises 64 independent multi-precision computing units, namely processing units PE, each multi-precision computing unit is organized in a two-dimensional broadcast network mode and realizes parallel computing, the multi-precision computing units are composed of multi-precision multipliers and multi-precision adders, the multi-precision multipliers are formed by arranging and combining 9 INT4 multipliers, a four-stage full pipeline structure is adopted, integer computing and floating point computing are supported at the same time, and each multi-precision computing unit is independently responsible for multiply-accumulate operation of 4 channels and corresponds to 4 elements of an output matrix D.
4. A multi-precision sparse tensor computing system according to claim 3, wherein the data formats supported by the multi-precision computing unit include INT4, UINT4, INT8, UINT8, FP16, BF16, TF32 and FP32, and support hybrid precision computing including INT4-INT32, INT8-INT32, UINT4-UINT32, UINT8-UINT32, FP16-FP32, BF16-FP32 and TF32-FP32.
5. The multi-precision sparse tensor computing system of claim 1, wherein the sparse data processing supports a structured sparse format comprising 1-4, 2-4, and 3-4 sparsities, wherein valid element index data of a matrix a stored by a routing cluster a in the input-output data cache module is sequentially recorded through 8 sparse index FIFOs, wherein only non-zero elements are scheduled to participate in sparse matrix multiply-accumulate computation based on the index data matching corresponding elements of the matrix B in the routing cluster B of the input-output data cache module.
6. The multi-precision sparse tensor computing system of claim 1, wherein the input-output data cache module (3) comprises a routing cluster, wherein: the routing cluster A adopts a row-by-row grouping strategy and is used for storing data of the matrix A; The routing cluster B adopts a column-by-column division strategy and is used for storing data of the matrix B; the routing clusters C and D are used for buffering data of the input matrix C and the output matrix D, respectively.
7. A multi-precision sparse tensor calculation method comprises the following steps: receiving control signals from an APB bus and data to be calculated from an AXI bus; scheduling the data to be calculated, and storing the data into an input/output data caching module according to a preset sequence; reading data from the input/output data caching module according to the period, and transmitting the data into a matrix calculation module; In the matrix calculation module, sparse data processing and matrix multiply-accumulate calculation are performed through a sparse indexing technology, wherein the matrix multiply-accumulate calculation supports a plurality of matrix formats, a plurality of data formats and a plurality of sparse formats; and transmitting the calculation result to the input/output data buffer module for rearrangement, and outputting the calculation result to an AXI bus in a line priority mode.
8. The multi-precision sparse tensor computation method of claim 7, wherein the matrix multiply-accumulate computation supports matrix sizes of 16 x 16, 32 x 8x 16, and 8x 32 x 16 and implements overflow protection for integer computation, handling non-spec, positive overflow, negative overflow, and NaN cases for floating point computation to ensure agreement with software computation results.
9. The multi-precision sparse tensor calculation method of claim 7, wherein in the matrix multiply-accumulate calculation step, overflow protection is performed on the integer data, and non-standard number, positive overflow, negative overflow, and non-number processing is performed on the floating point data to ensure that the calculation result meets IEEE 754 standard.
10. The method according to claim 7, wherein the step of performing sparse data processing by sparse indexing technique comprises distributing data blocks of matrix A in a row cycle, and synchronously selecting elements corresponding to valid data positions of matrix A from matrix B according to pre-stored sparse indexes for matching calculation.

Description

Multi-precision sparse tensor calculation system and method Technical Field The invention belongs to the design of customized hardware accelerators in the field of parallel computing and customized circuit design, in particular to a customized hardware accelerator for tensor computing, and particularly relates to a sparse tensor computing method and system supporting multiple data precision and multiple sparse formats. Background Tensor calculation (Tensor Computation) is a high-performance calculation method widely applied in the fields of deep learning, scientific calculation, robot optimization and the like in recent years. Tensors are generalizations of matrices in higher dimensional space, and can naturally represent complex association relations among multidimensional data. The matrix is essentially a second order tensor, thus, the matrix computation is a special case of tensor computation. In computer vision, semantic understanding, and neural networks, data often has multidimensional features such as time, space, channels, dimensions, etc., so using tensors as the underlying data structure helps in unified representation and efficient computation. Tensor calculation realizes relational modeling and optimization solving between data by carrying out high-dimensional linear algebraic operation (such as tensor product, contraction and decomposition) on tensors. Compared with traditional vector or matrix calculation, tensor calculation can reduce redundant calculation and improve memory access efficiency while maintaining data structure integrity. At the hardware level, tensor computation typically relies on parallel computing architectures, e.g., GPU, TPU, to achieve the high throughput requirements of tensor computation. GPUs are capable of general purpose accelerated computation, but their speed and power consumption are often inferior to the customized circuit design of TPU-like in the dedicated tensor computation field. The customized circuit design shows remarkable performance and energy efficiency advantages in tensor calculation, and can be customized and optimized according to characteristics of tensor operation (such as high-dimensional data parallelism, sparse structure access and matrix multiplication dense calculation), so that storage access delay is reduced while high throughput is guaranteed. The tensor calculation accelerator is realized in the customized circuit, the logic resource and on-chip storage which can be executed in parallel can be fully utilized, the stream processing and parallel calculation of data are realized, the real-time calculation requirement is met, and the tensor calculation of various data precision, sparse type and matrix size can be realized by fully multiplexing the circuit resource. Disclosure of Invention The invention aims to solve the problem of insufficient flexibility of the existing tensor calculation technology and meet the acceleration requirement of high-throughput and low-energy matrix multiply-add operation in high-performance calculation and artificial intelligence tasks, thereby providing a multi-precision sparse tensor calculation system and method and realizing the tensor calculation unit design scheme supporting various matrix formats, data precision (INT 4 to FP 32) and structured sparsity (1:4, 2:4 and 3:4). In order to achieve the above object, the present invention proposes the following technical solutions: In a first aspect, the present invention proposes a multi-precision sparse tensor computing system comprising: The bus data transmission module 1 is used for receiving control signals from the APB bus and data to be calculated from the AXI bus and sending the calculated data to the AXI bus; the data scheduling module 2 is connected to the bus data transmission module 1 and is used for scheduling input data and storing the input data into the input/output data caching module 3 according to a preset sequence and scheduling the storage of the calculated data; The input/output data caching module 3 is connected to the data scheduling module 2 and is used for caching input data and transmitting the input data into the matrix computing module 4 according to the period, receiving the computing result, and outputting the computing result after rearrangement; the matrix calculation module 4 is connected to the input/output data buffer module 3, and is configured to perform sparse data processing and matrix multiply-accumulate calculation, and transmit the calculation result to the input/output data buffer module 3. In some embodiments, the matrix multiply-accumulate computation is performed in parallel by a two-dimensional broadcast network organization formed by a routing module in conjunction with an array of processing units, specifically including: In the path of matrix A, data is broadcast along the column direction, and in the path of matrix B, data is broadcast along the row direction to ensure that each computing unit receives different