CN-121996609-A - AI processor based on memory and calculation integration, three-dimensional integration and PCIe remapping

CN121996609ACN 121996609 ACN121996609 ACN 121996609ACN-121996609-A

Abstract

The application relates to an AI processor based on memory integration, three-dimensional integration and PCIe remapping, which comprises at least one memory unit, at least one calculation unit and PCIe interface device, wherein the memory unit comprises a memory circuit formed by a plurality of memory arrays and is configured to store input characteristic data, neural network weight data, characteristic data and intermediate calculation results of an input neural network model, the at least one calculation unit is communicatively coupled with the at least one memory unit through a three-dimensional integration technology and comprises a neural network processing unit based on the memory integration array and is configured to execute neural network related calculation by using data of the memory unit, and the PCIe interface device is at least compatible with one PCIe memory standard protocol and is configured to analyze information of transaction layer data packet heads and remap calculation instructions which are recognizable by the calculation unit based on a preset remapping mechanism. Through the remapping mechanism of the PCIe standard interface, the neural network calculation with low cost, high integration level and high efficiency is realized.

Inventors

CHEN PEIYU
GE XIAOHUAN
WANG ZHIXUAN
LIU YING

Assignees

无锡微纳核芯电子科技有限公司
杭州微纳核芯电子科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (20)

1. An AI processor based on memory integration, three-dimensional integration and PCIe remapping, comprising: The storage unit comprises a storage circuit formed by a plurality of storage arrays and is used for storing input characteristic data, neural network weight data, characteristic data and intermediate calculation results of an input neural network model; At least one computing unit communicatively coupled to the at least one memory unit by a three-dimensional integration technique, the computing unit including a memory-integrated-array-based neural network processing unit for performing neural network-related computations using data of the memory unit, and And the PCIe interface device is compatible with at least one PCIe memory standard protocol, can analyze the information of the header of the transaction layer data packet, and remaps the information into a computing instruction which can be identified by the computing unit based on a preset remapping mechanism.
2. The memory-integrated, three-dimensional integrated, and PCIe remap-based AI processor of claim 1, wherein the PCIe interface means is to: A continuous reserved address space is declared as an instruction space through a base address register of the PCIe configuration space; The received transaction layer data packet which accords with PCIe specification is monitored in real time, and address fields in the head of the transaction layer data packet are compared with the address range of the instruction space; When the address field in the header of the transaction layer data packet is detected to fall into the instruction space, the transaction layer data packet is intercepted, and the content of the data segment is extracted as a calculation control instruction.
3. The memory-based, three-dimensional integrated and PCIe remap AI processor of claim 2, wherein the computing unit further comprises an instruction parsing unit, The PCIe interface device is configured to send the fetched computation control instruction to the instruction parsing unit, The instruction analysis unit is used for analyzing control information comprising instruction types, calculation parameters and data addresses, and the control information is used for driving the calculation unit of the AI processor to execute corresponding neural network operation.
4. The memory-integrated, three-dimensional integrated, and PCIe remap-based AI processor of claim 1, wherein the PCIe interface device is further to: monitoring a format field and a type field in the header of the transaction layer data packet; when the format field is detected to be a first preset value and the type field is detected to be a second preset value, identifying the monitoring transaction layer data packet as a custom calculation control instruction; The combination of the first preset value and the second preset value belongs to undefined fields of a PCIe standard protocol, the number of the second preset values is multiple, and different second preset values are respectively mapped into a calculation task issuing instruction, a calculation state query instruction, a calculation result reading instruction and a device configuration instruction.
5. The memory-integrated, three-dimensional integrated, and PCIe remap-based AI processor of claim 1, wherein the PCIe interface device is further to: Extracting calculation parameters from a custom control parameter area of the transaction layer data packet header, wherein the calculation parameters comprise a calculation type identifier, calculation precision, matrix dimension information, a physical address of input data in the storage unit and an output data address; The header of the monitoring transaction layer data packet accommodates the custom control parameter area through expanding bit width.
6. The memory-integrated, three-dimensional integrated, and PCIe remapping-based AI processor of claim 1, wherein the protocol remapping mechanism includes mapping first memory data to compute instructions recognizable by the compute unit, the first memory data being memory data read from and written to memory circuitry of a memory unit specific address space.
7. The memory-integrated, three-dimensional integrated, and PCIe-remapped AI processor of claim 1, wherein the protocol remapping mechanism comprises: The reserved field is defined in a standard PCIe instruction set as an instruction that the computing unit can recognize or, And (3) part of instructions of the standard PCIe are repeatedly utilized, and the synchronization is used as the instructions which can be identified by the computing unit.
8. The memory-based, three-dimensional integrated, and PCIe remap AI processor of any one of claims 1-7, wherein the memory-based, array of neural network processing units is to: receiving input characteristic data of a neural network and weight data from the storage unit; performing a neural network calculation based on the weight data and the input feature data; and outputting the calculation result to the storage unit or the external equipment.
9. The memory-unified, three-dimensional integrated, and PCIe remap based AI processor of any one of claims 1-7, wherein the PCIe interface device is PCIe memory standard protocol compliant, capable of being recognized as a standard storage device and accessed in memory format by a host system or a PCIe standard enabled memory controller.
10. The memory integrated, three-dimensional integrated and PCIe remap based AI processor of any of claims 1-7, wherein the three-dimensional integrated technology includes at least one of through silicon via technology, flip chip technology, hybrid bonding technology, micro bump connection technology.
11. The memory-unified, three-dimensional integrated, and PCIe remap based AI processor of any of claims 1-7, wherein the PCIe memory standard protocol supported by the PCIe interface device is selected from at least one of PCIe 4.0, PCIe 5.0, PCIe 6.0, PCIe 7.0 to adapt to different bandwidths.
12. The memory, three-dimensional integration, and PCIe remap based AI processor of any of claims 1-7 wherein the memory unit and the computing unit are interchangeable in stacking order to accommodate different heat dissipation and stress requirements.
13. The memory, three-dimensional integration, and PCIe remap based AI processor of any one of claims 1-7, further comprising: an interface checking circuit for performing error detection and correction on the command and data information interacted with the external main control chip through the PCIe interface device to prevent error data from entering the memory unit or the computing unit, and And the on-chip checking circuit is used for carrying out error detection and correction on the data stored in the storage unit and the intermediate calculation result in the calculation unit so as to ensure the reliability of the data in the AI processor in the cross-layer transmission and calculation process.
14. The memory-integrated, three-dimensional integrated, and PCIe remap-based AI processor of claim 13 wherein the PCIe interface device is disposed at the memory unit and the interface verification circuitry and on-chip verification circuitry are disposed at the memory unit.
15. The memory integrated, three-dimensional integrated and PCIe remap based AI processor of claim 13 wherein the PCIe interface device is disposed at the computing unit, the interface verification circuitry is disposed at the computing unit, and the on-chip verification circuitry is disposed at the storage unit or the computing unit.
16. The memory integrated, three-dimensional integrated, and PCIe remap based AI processor of any of claims 1-7 wherein the memory integrated array is implemented based on at least one of SRAM, reRAM, MRAM or FeFET technologies.
17. The memory-based, three-dimensional integrated, and PCIe remap AI processor of any one of claims 1-7, wherein the computation results output by the memory-based array of neural network processing units are transmitted through at least one of direct writing to designated memory areas of the memory units, transmission to an external host chip through the PCIe interface device, and transmission to other expansion memory units through a three-dimensional integrated connection structure.
18. The memory, three-dimensional integration, and PCIe remap based AI processor of claim 7 wherein the "reuse standard PCIe partial instruction" is specifically: And selecting an instruction or a reserved instruction with lower use frequency in a PCIe memory standard protocol, redefining the instruction or the reserved instruction as a control instruction of the computing unit, so that the instruction has the original memory unit control function and the computing unit control function at the same time.
19. The AI processor of any of claims 1-7, further comprising a transmission unit stacked with the storage unit and the computing unit in a three-dimensional integration technique, the PCIe interface device being disposed at the transmission unit.
20. The memory-integration, three-dimensional integration, and PCIe remapping-based AI processor of claim 19, wherein the PCIe interface device includes a PCIe protocol parser disposed at the transmission unit.

Description

AI processor based on memory and calculation integration, three-dimensional integration and PCIe remapping Technical Field The application relates to the technical field of artificial intelligence, in particular to an AI processor based on memory integration, three-dimensional integration and PCIe remapping. Background With the rapid development of artificial intelligence technology, a computing model represented by a deep neural network (particularly a large model based on a transducer structure, such as Llama 2-7B) has been widely applied to the fields of image recognition, natural language processing and the like. The self-attention mechanisms and codec structures of such models place extremely high demands on the computational power and data bandwidth of the hardware—one large model reasoning may require trillion multiply-add operations, which is difficult for traditional hardware architectures to meet. Traditional von willebrandThe data is frequently carried through the bus because the computing unit and the memory unit are physically separated by the Neumann (Von Neumann) architecture processor, and is limited by the bus bandwidth, forming a "memory wall" bottleneck. For neural network computation, the handling of a large amount of weight data and feature data not only consumes energy, but also severely limits hardware performance, and makes the computation density and bandwidth a common constraint factor. In order to break the bottleneck, a computer-in-Memory (CIM) technology has been developed, and the core of the CIM technology is to fuse a Memory with an arithmetic unit, pre-store weight data in a Memory array, directly execute matrix vector multiplication in the Memory after inputting feature vectors to the array, fundamentally reduce data handling and improve the power density and energy efficiency (TOPS/W). However, the CIM technology still requires external storage to provide feature data, and if the inter-chip communication bandwidth is insufficient, the overall performance of the system is still limited. The three-dimensional Integration (3D Integration) technology realizes vertical stacking and interconnection of chips Through Silicon Vias (TSVs), hybrid bonding and the like, has the advantages of short path and high interconnection density, can provide a bandwidth far exceeding that of the traditional two-dimensional package, effectively relieves a memory wall between chips, namely a High Bandwidth Memory (HBM), is a commercial model of the technology, and provides sufficient bandwidth support for high-performance calculation. However, three-dimensional integration only solves the problem of "data transmission pipeline", and if the computing engine is still an inefficient architecture, the value of the high bandwidth link will be wasted. Therefore, the collaborative architecture of the integrated technology and the three-dimensional integration technology is complementary, and the bottleneck of the traditional architecture can be broken through by applying the integrated technology and the three-dimensional integration technology at the same time, so that the performance of the chip system is greatly improved. However, the flexibility of interface interconnection is another core pain point at present, namely, the neural network processor needs to communicate with external devices (such as a CPU, a GPU and a mobile phone main chip) through an interface to form a complete system, but the existing scheme mostly adopts a custom communication interface. The interface requires the collaborative design of interconnection equipment, so that the equipment selection space, such as an AI accelerator of a certain custom interface, can only be matched with a main control chip of a specific manufacturer, cannot be compatible with a main GPU or a main mobile phone chip of a market, and has poor system integration flexibility and difficult overall efficiency optimization. Accordingly, there is a need for improvements to existing neural network processors. Disclosure of Invention In view of the above, the present application provides an AI processor (hereinafter also referred to as a neural network processor) based on memory integration, three-dimensional integration, and PCIe remapping, including: at least one memory unit including a memory circuit composed of a plurality of memory arrays configured to store input feature data, neural network weight data, feature data, intermediate calculation results of the input neural network model; At least one computing unit communicatively coupled to the at least one memory unit by a three-dimensional integration technique, the computing unit including a memory-integrated-array-based neural network processing unit configured to perform neural network-related computations using data of the memory unit, and And the PCIe interface device is compatible with at least one PCIe memory standard protocol, and can analyze the information of the header of the transaction layer