CN-122018850-A - Matrix operation realization device, method, processor architecture, chip and equipment

CN122018850ACN 122018850 ACN122018850 ACN 122018850ACN-122018850-A

Abstract

The application relates to the technical field of processors and provides a matrix operation realization device, a method, a processor architecture, a chip and equipment, wherein the device is a part of a central processing unit kernel and comprises a vector physical register group and a matrix operation unit; the matrix operation unit comprises a matrix buffer structure and a control unit, wherein the control unit is used for loading matrix data corresponding to a matrix operation instruction from the vector physical register set to the matrix buffer structure, reading the matrix data from the matrix buffer structure and providing the matrix data to the vector dot product array to execute matrix operation. The embodiment of the application can avoid or relieve the access conflict of matrix multiplication to the register file with lower hardware and other overheads, thereby improving the energy efficiency and performance of the CPU for executing matrix intensive loads such as AI reasoning and the like.

Inventors

XUE DAQING

Assignees

成都群芯微电子科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251215

Claims (18)

1. A matrix operation enabling device, said device being part of a central processor core, said device comprising: Vector physical register set, and A matrix operation unit; the matrix operation unit includes: A matrix buffer structure; Vector dot product array, and And the control unit is used for loading matrix data corresponding to the matrix operation instruction from the vector physical register set to the matrix buffer structure, reading the matrix data from the matrix buffer structure and providing the matrix data for the vector dot product array to execute matrix operation.
2. The matrix operation enabling device of claim 1 wherein the device is part of a vector execution unit of a central processor core.
3. The matrix operation enabling device of claim 1 wherein the matrix buffer structure comprises a matrix erase board.
4. The matrix operation realization apparatus according to claim 3, wherein the control unit is further configured to: and disassembling the matrix operation instruction into at least one loading micro instruction and at least one calculating micro instruction so as to adapt to the matrix erasing board.
5. The matrix operation realization apparatus according to claim 4, wherein the number of the matrix erasing boards is at least two, and the control unit is configured to make at least two of the matrix erasing boards participate in matrix data loading and matrix data operation alternately concurrently.
6. The matrix operation realization apparatus according to claim 4, wherein the control unit comprises a matrix multiplication controller including: The matrix instruction queue is used for caching the received matrix multiplication operation instruction; The micro instruction scheduling queue is used for storing the disassembled micro instructions; and the finite state machine controller is used for dispatching the micro instructions and managing the states of the matrix erasing board.
7. The matrix operation realization apparatus according to claim 1 wherein the vector dot product array comprises a plurality of vector dot product operation units supporting multi-channel vector data concurrent operations.
8. The matrix operation enabling device of claim 7 wherein each of said vector dot product operation units comprises a two-stage computation structure comprising: the first-stage computing structure is used for computing dot products of the multi-channel vector data to obtain vector dot product results; And the second-stage computing structure is used for accumulating vector dot product results of the multi-channel vector data and outputting the accumulated results to a matrix physical register group divided from the vector physical register group.
9. The matrix operation enabling device of claim 6 wherein said finite state machine controller internally contains a list of free global identifiers, the number of bits of a global identifier in said list covering a number of bits capable of indexing all different matrix multiplication instructions to be executed in a matrix instruction queue and a micro instruction dispatch queue, the target bit of said global identifier indicating a matrix erasure plate associated therewith.
10. The matrix operation enabling device of claim 9 wherein said matrix multiplication controller is further configured to: And when the matrix instruction queue is not full and the free global identifiers exist in the list, receiving a matrix multiplication operation instruction output by a central processing unit kernel distribution unit, distributing a global identifier for the matrix multiplication operation instruction from the list, and writing the matrix multiplication operation instruction associated with the global identifier into the tail of the matrix instruction queue.
11. The matrix operation enabling device of claim 9 wherein said matrix multiplication controller is further configured to: when the micro instruction scheduling queue has an idle entry, disassembling a matrix multiplication operation instruction positioned at the head of the matrix instruction scheduling queue into at least one loading micro instruction and at least one calculating micro instruction, writing the loading micro instruction, the calculating micro instruction and global identifiers related to the loading micro instruction and the calculating micro instruction into the corresponding idle entry in the micro instruction scheduling queue, and when all micro instructions disassembled by the matrix multiplication operation instruction are written into the idle entry of the micro instruction scheduling queue, releasing the entry occupied by the matrix multiplication operation instruction in the matrix instruction queue.
12. The matrix operation enabling device of claim 9 wherein said matrix multiplication controller is further configured to: The finite state machine controller monitors a first ready signal and a second ready signal corresponding to a load micro-instruction, the first ready signal represents that a matrix physical register set corresponding to a source operand of the load micro-instruction is ready, the second ready signal represents that a target matrix erasing board corresponding to a destination operand of the load micro-instruction is ready, when the first ready signal and the second ready signal corresponding to the load micro-instruction are received, and a global identifier of the target matrix erasing board of the load micro-instruction is matched with a global identifier of the finite state machine controller, the finite state machine controller outputs an emission signal aiming at the load micro-instruction to the micro-instruction scheduling queue so as to take the load micro-instruction out of the micro-instruction scheduling queue and dispatch the load micro-instruction to a corresponding load execution unit for execution, and when the load micro-instruction is executed, an occupied micro-instruction scheduling queue entry is released.
13. The matrix operation enabling device of claim 9 wherein said matrix multiplication controller is further configured to: The finite state machine controller monitors a third ready signal and a fourth ready signal corresponding to a calculation micro instruction, wherein the third ready signal represents the readiness of a source operand register of the calculation micro instruction, the fourth ready signal represents the readiness of a vector dot product array corresponding to a destination operand of the calculation micro instruction, when the third ready signal and the fourth ready signal corresponding to the calculation micro instruction are received and a global identifier of a source operand register of the calculation micro instruction is matched with a global identifier of the finite state machine controller, the finite state machine controller outputs an emission signal aiming at the calculation micro instruction to a micro instruction scheduling queue so as to take the calculation micro instruction out of the micro instruction scheduling queue and dispatch the calculation micro instruction to a corresponding calculation execution unit for execution, and when the calculation micro instruction is executed, an occupied micro instruction scheduling queue entry is released.
14. A processor architecture comprising the matrix operation implementing apparatus of any one of claims 1-13.
15. A chip comprising the matrix operation realization apparatus according to any one of claims 1 to 13.
16. A matrix operation implementation method, applied to the apparatus of any one of claims 1 to 13, the method comprising: Receiving a matrix operation instruction; Loading matrix data corresponding to the matrix operation instruction from a vector physical register set to a matrix buffer structure; The matrix data is read from the matrix buffer structure and provided to the vector dot product array for performing matrix operations.
17. The matrix operation implementation method of claim 16 wherein the matrix buffer structure comprises a matrix erasing board, and further comprising, after the receiving the matrix operation instruction: and disassembling the matrix operation instruction into at least one loading micro instruction and at least one calculating micro instruction so as to adapt to the matrix erasing board.
18. A computer device comprising a memory, a processor, and a computer program stored on the memory, characterized in that the computer program, when being executed by the processor, executes instructions of the method according to claim 16 or 17.

Description

Matrix operation realization device, method, processor architecture, chip and equipment Technical Field The present application relates to the field of processor technologies, and in particular, to a matrix operation implementation device, a matrix operation implementation method, a processor architecture, a chip, and a device. Background The matrix operation is supported at the CPU end, the programming model is simple, no additional heterogeneous acceleration card support is needed, and the method has great advantages in the deployment of AI reasoning operation. Therefore, with the widespread use of artificial intelligence technology, AI reasoning operations based on neural networks have been widely deployed on Central Processing Units (CPUs). The core of such operations is a large number of matrix multiplication operations. To perform these computations efficiently on a general purpose CPU, modern CPU instruction set architectures (Instruction Set Architecture, ISA) typically incorporate specialized matrix computing instructions and integrate corresponding matrix execution units in the hardware pipeline. The instruction set provides a group of matrix registers as a software programming interface, allows matrix data to be loaded into the registers from a main memory, and directly completes multiply-add operation in a core, thereby avoiding the bottleneck of frequently accessing the memory in the traditional method and improving the computing efficiency. In a CPU microarchitecture implementation supporting matrix operations, one key design challenge is how to efficiently supply operand data to the matrix multiplication units, and in particular how to manage access to the register file. Currently, many modern CPUs use a scheme of multiplexing an existing Vector physical register set (Vector PHYSICAL REGISTER FILE, VPRF) to construct a matrix register (when the CPU core enters a matrix operation mode, a part of Vector physical registers predefined in the Vector physical register set is logically bound to form a required matrix register). A typical microarchitectural implementation of matrix multiplication by multiplexing VPRF of support vector operations (e.g., the vector matrix of fig. 1) is shown in fig. 1. Referring to FIG. 1, a matrix access instruction (such as MatrixLoad, matrixStore in FIG. 1) may be dispatched to an access Unit (Load Store Unit, LSU) by a dispatch Unit to Load matrix data into a matrix physical register set (such as matrix PRF in FIG. 1), a matrix multiplication instruction (such as MatrixMad in FIG. 1) is dispatched to a dispatch queue (ScheduleQueue), a matrix multiplication controller (Matrix Multiply Controller, MMC) dispatches matrix data to a matrix multiplication execution Unit (VDPB array) to complete a multiply add operation and eventually write back to MPRF after the matrix data is ready in MPRF, and finally a matrix Store instruction (MatrixStore) writes the result back into main memory by the LSU. However, this approach presents a significant performance bottleneck in that the computation unit (e.g., a vector dot product array) typically needs to read one row vector of matrix a and multiple column vectors of matrix B simultaneously in one cycle when performing matrix multiply add operations. For example, for a result matrix that supports 16 elements per cycle output, it may be necessary to read 1 row vector of matrix a and 16 column vectors of matrix B simultaneously. This means that the vector physical register set needs to support up to 17 read port operations in one cycle. Implementing such a multi-port register file would present significant design challenges, on the one hand, the multi-port register file would dramatically increase chip area and routing complexity, and on the other hand, it would place significant stress on timing closure, possibly limiting the rise of CPU main frequency. Moreover, the continuous high-intensity occupation of the register ports by matrix operations can severely block the scheduling of other vector instructions (especially matrix data load/store instructions) so that computation and data handling are difficult to be parallel, thereby reducing the overall execution efficiency and hardware utilization of matrix multiplication. Therefore, an innovative micro-architecture scheme is urgently needed, and on the premise of keeping the advantages of a general programming model of a CPU, access conflict of matrix multiplication to a register file can be avoided or relieved with lower hardware and other overhead, so that energy efficiency and performance of the CPU for executing matrix intensive loads such as AI reasoning and the like are improved. Disclosure of Invention The embodiment of the application aims to provide a matrix operation realization device, a matrix operation realization method, a processor architecture, a chip and a device, so as to avoid or relieve access conflict of matrix multiplication to a register file with lower hardwar