CN-121979483-A - Tensor operator based on RISC-V instruction set and intelligent processor

CN121979483ACN 121979483 ACN121979483 ACN 121979483ACN-121979483-A

Abstract

The invention provides a tensor operator based on a RISC-V instruction set, which comprises a micro instruction splitting and scheduling unit, a hardware multi-buffer unit and a hardware multi-buffer unit, wherein the micro instruction splitting and scheduling unit is used for receiving a decoded macro tensor instruction and operation size parameters thereof, splitting the macro tensor instruction into a micro instruction sequence according to a fixed physical scale of a reconfigurable computing array, realizing zigzag traversal sequence and micro instruction scheduling through five-fold hardware circulation, planning loading, using and replacing sequences of data blocks in a partitioned register array, the reconfigurable computing array is used for performing matrix multiplication accumulation operation of various data formats by integrating floating point multipliers and multi-precision accumulation trees under different bit width data paths through designing the reconfigurable computing array, and realizing data prefetching and running water overlapping of computing execution by matching with the zigzag traversal sequence for a multi-buffer architecture of hardware automatic management. The invention also provides an intelligent processor. Thus, the present invention can efficiently execute variable-precision and variable-scale tensor operations.

Inventors

GUO QI
WEN YUANBO
WANG ZHE
XU GUANGLIN

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260505
Application Date: 20260228

Claims (10)

1. The tensor arithmetic unit based on RISC-V instruction set is characterized by comprising a micro-instruction splitting and scheduling unit, a reconfigurable computing array, a blocking register array and a hardware multi-buffer unit; The micro-instruction splitting and scheduling unit is used for receiving the decoded macro-tensor instruction and the operation size parameter thereof, splitting the macro-tensor instruction into a micro-instruction sequence according to the fixed physical scale of the reconfigurable computing array, realizing the zigzag traversing sequence and micro-instruction scheduling through five hardware loops, planning the loading, using and replacing sequence of a data block in the block register array, and multiplexing the data which is already resident in the block register array by adjacent micro-operations; the reconfigurable computing array is used for performing matrix multiply-accumulate operation of multiple data formats by designing a reconfigurable data path, fusing and multiplexing floating point multipliers and multiple-precision accumulate trees under different bit-width data paths; The hardware multi-buffer unit is arranged between the partitioned register array and the reconfigurable computing array, is a multi-buffer architecture for automatic hardware management, and is matched with the zigzag traversal sequence to realize the overlapping of data prefetching and the running water of computing execution.
2. The tensor operator based on RISC-V instruction set according to claim 1, wherein the microinstruction splitting scheduling unit adopts a distributed modular design, and comprises a first loading splitting module, a second loading splitting module, a calculating splitting module and a writing back splitting module which are mutually cooperated, wherein each module is internally provided with a circulation state machine matched with the zigzag traversal sequence; The calculation splitting module divides an output matrix corresponding to the macroscopic tensor instruction into output sub-blocks adapting to the physical size of the reconfigurable calculation array and generates the micro instruction sequence; The first loading splitting module and the second loading splitting module generate access address sequences matched with the zigzag traversal sequence and manage corresponding partitioned registers; the write-back splitting module generates a write-back address and a control signal for the calculated output sub-block; And each module cooperates through a hardware synchronizing signal, and when the data which is depended by the micro instruction generated by the calculation splitting module is not ready, the hardware automatically blocks the calculation splitting module until the data is ready.
3. The tensor operator based on RISC-V instruction set according to claim 1, wherein the reconfigurable computing array is a 3D matrix computing array with a fixed physical scale, and is composed of cM×cN processing units, each processing unit can continuously process cK times of multiply-accumulate operations in depth, the core of the processing unit is a configurable floating-point multiplier, the floating-point multiplier performs multiplication of multiple data formats, the multi-precision accumulating tree is a high-precision Kulisch accumulating tree, and the multi-precision accumulating tree is used for nondestructively accumulating partial products or intermediate results output by the floating-point multipliers with different precision, and then rounding output is completed according to a target precision.
4. A tensor operator based on a RISC-V instruction set according to claim 3, said processing unit operating in brain floating point 16 bit BF16 mode with the mantissa multiplier of the floating point multiplier operating in 8x 8 bit mode and performing standard BF16 format multiplication, and in mixed precision floating point 8 bit MXFP mode with the mantissa multiplier split multiplexed into two separate 4 x 4 bit sub-mantissa multipliers, performing two MXFP format multiplications in parallel.
5. The tensor operator according to claim 1, wherein the matrix multiply-accumulate operation performed by the reconfigurable computing array takes an input tensor matrix A, K x N size input tensor matrix B of M x K size specified by the macroscopic tensor instruction as an operation input, takes an output tensor matrix C of M x N size as an operation output, performs multiply-accumulate iterative update on each element of the output tensor matrix C during operation, and performs full-dimension multiply accumulation based on element initial values or intermediate part accumulation results to obtain element final results, wherein each data block of the input tensor matrix a can be multiplexed N times without repeated loading during the complete matrix multiply-accumulate operation, and each data block of the input tensor matrix B can be multiplexed sM times, wherein M, K, N is the operation size parameter specified by the macroscopic tensor instruction, sM is a M dimension block-level parameter, and cM is the reconfigurable computing dimension M dimension array parameter.
6. The tensor operator based on RISC-V instruction set according to claim 1, wherein the five-element hardware loop is mapped into a six-dimensional nested hardware loop adapted to complete the matrix multiply-accumulate operation, the six-dimensional nested hardware loop is sequentially, from outside to inside, a K-dimensional partition loop, an M-dimensional partition loop, an N-dimensional partition loop, a K-dimensional stripe loop, an N-dimensional column block loop, and an M-dimensional row block loop, and the generation and scheduling of the micro instruction sequence are completed by combining control logic of the K-dimensional correlation loop to realize the traversal order of the transpose of the output matrix.
7. The RISC-V instruction set based tensor operator of claim 1, wherein the hardware multi-buffer module is of a double-buffer structure, two sets of physical registers are configured as buffers for each set of input tensor data, a current working buffer and a preparation buffer, respectively, and when the reconfigurable computing array executes microinstructions based on data in the current working buffer, data loading logic preloads a next data block to be multiplexed into the preparation buffer according to a traversal order of the zigzags, tensor data organization and supply inside each set of buffers follows the traversal order of the zigzags, ensuring that the reusable data block continuously resides in each of the buffers in a plurality of micro-operations continuously executed.
8. The tensor operator of claim 1 wherein said array of partitioned registers comprises a first set of partitioned registers comprising sM x sK first registers each storing a tensor data block of cM x cK size, a second set of partitioned registers comprising sM x sK second registers each storing a tensor data block of cN x cN size, and a third set of partitioned registers each matching both physical size parameters and tensor hierarchical parameters of said reconfigurable computing array, said third set of partitioned registers comprising sM x sN third registers each storing a tensor data block of cM x cN size for temporarily storing partial output tensor results and/or final results.
9. The tensor operator based on the RISC-V instruction set according to claim 1, wherein the macro tensor instruction processed by the micro instruction splitting and scheduling unit is a 64-bit fixed-length RISC-V tensor extension instruction, the low 32 bits of the macro tensor instruction are completely aligned with the coding specification of the standard RISC-V32 bit instruction, the high 32 bits are used for carrying extension coding information of tensor operation, a 6-bit precision coding field is arranged in the macro tensor instruction, the high 3 bits of the field are coding values x, the low 3 bits are coding values y, the value ranges of the coding values x and y are all 0-7, the precise data format of the current tensor operation is determined through a preset floating point format coding table of which the coding values x and y are commonly indexed by 8 x and y, and a corresponding precision configuration signal is generated and sent to the reconfigurable computing array.
10. An intelligent processor comprising a tensor operator based on a RISC-V instruction set according to any one of claims 1-8.

Description

Tensor operator based on RISC-V instruction set and intelligent processor Technical Field The invention relates to the technical field of computer architecture, in particular to a tensor arithmetic unit and an intelligent processor based on a RISC-V instruction set. Background In a high-performance intelligent processor, a tensor operator is a core component for performing computation tasks such as deep learning. To improve performance and energy efficiency, specialized instruction sets and custom hardware are commonly employed in the industry. However, with the development of diversity of artificial intelligence models, demands for calculation accuracy and calculation scale are increasingly complex and variable. The prior art mainly adopts two design ideas, namely, a special operation unit with fixed precision and fixed size is designed. The design hardware has a simple structure and extremely high efficiency under specific tasks, but cannot flexibly adapt to different precision and scale requirements, so that the hardware utilization rate is low, or the system is forced to be provided with a plurality of operation units, and the chip area and the power consumption are increased. Secondly, flexibility is achieved by software configuration using fully programmable vector or tensor processors, but it typically uses fine-grained instruction flow control complex data paths, resulting in significant differences in computational density and energy efficiency compared to dedicated units. In recent years, with the rise of open source ecology of RISC-V, an intelligent processor expansion scheme based on RISC-V is attracting attention. For example, RISC-V vector expansion (RVV) and matrix expansion (AME) provide underlying parallel computing instructions, but they do not specify a specific implementation of the underlying arithmetic unit. In practical hardware implementation, how to design a tensor operator which can inherit the flexibility of supporting variable precision and operation size of an upper instruction set and can keep the high efficiency close to a special circuit becomes a key challenge. Particularly when supporting coarse-grained tensor instructions such as matrix multiply-accumulate, efficient instruction splitting and data scheduling mechanisms are required when the scale of operations described by the instructions exceeds the physical size of the physical arithmetic units. The prior art has the problem that flexibility, efficiency and complexity are difficult to be compatible when realizing a tensor arithmetic unit supporting variable precision and size. The fixed function arithmetic unit can not adapt to various calculation precision and calculation size, but the fully programmable scheme is flexible, but the instruction cost is large, the control is complex, and high calculation density and energy efficiency are difficult to realize. In particular, when supporting coarse-granularity tensor instructions (e.g., instructions describing large-scale matrix multiplication), tensor operators need to have the ability to split macroscopic operations into micro-operations that fit the physical hardware size. However, conventional instruction splitting methods, such as sequential blocking, tend to result in irregular multiplexing patterns of data in the on-chip blocking registers (TILE REGISTER), increasing the demands on register capacity and bandwidth, thus limiting the data multiplexing rate and overall performance. Part of designs lack of efficient hardware data handling and computation overlapping mechanisms such as double buffering in the split execution process, which easily causes that a computation unit is idle due to waiting for data and cannot be fully utilized, and the potential performance of hardware is difficult to fully develop while the simplicity of a programming model is maintained. In summary, it is clear that the prior art has inconvenience and defects in practical use, so that improvement is needed. Disclosure of Invention In view of the above-described drawbacks, an object of the present invention is to provide a tensor operator and an intelligent processor based on a RISC-V instruction set, which can efficiently perform tensor operations of variable precision and variable scale. In order to solve the technical problems, the invention is realized as follows: In a first aspect, an embodiment of the present invention provides a tensor operator based on a RISC-V instruction set, including a microinstruction splitting and scheduling unit, a reconfigurable computing array, a block register array, and a hardware multi-buffer unit; The micro-instruction splitting and scheduling unit is used for receiving the decoded macro-tensor instruction and the operation size parameter thereof, splitting the macro-tensor instruction into a micro-instruction sequence according to the fixed physical scale of the reconfigurable computing array, realizing the zigzag traversing sequence and micro-instruction