CN-122021755-A - AI processor based on memory and calculation integration, three-dimensional integration and memory and calculation parallelism

CN122021755ACN 122021755 ACN122021755 ACN 122021755ACN-122021755-A

Abstract

The application relates to an AI processor and a control method based on memory calculation integration, three-dimensional integration and memory calculation parallelism, comprising a memory unit, a calculation unit, an interface device and a control module, wherein the memory unit comprises a plurality of memory arrays, the calculation unit is in three-dimensional integration connection with the memory unit and comprises a neural network processing unit, the interface device is used for carrying out data transmission and instruction interaction with external equipment, the control module is arranged in the memory unit and is used for controlling the memory unit to execute a memory request of the external equipment and a calculation request of the calculation unit in the same clock period, the memory unit allows the calculation unit to access a first subset formed by the plurality of memory arrays in the memory unit through a three-dimensional integration connection channel so as to execute the calculation request of the calculation unit, and the memory unit allows the external equipment to access a second subset formed by the plurality of memory arrays in the memory unit through the interface device so as to execute the memory request of the external equipment, and simultaneously execute memory read-write and calculation tasks, so that memory access conflicts can be avoided, and the throughput rate and the bandwidth utilization rate of a system can be improved.

Inventors

WANG ZHIXUAN
GE XIAOHUAN
CHEN PEIYU
LIU YING

Assignees

无锡微纳核芯电子科技有限公司
杭州微纳核芯电子科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (16)

1. An AI processor based on memory integration, three-dimensional integration and memory parallelism, comprising: The memory unit comprises a plurality of memory arrays which can be independently addressed and accessed in parallel and is used for storing the weight data of the neural network, the input characteristic data of the neural network, the intermediate calculation result of the neural network and the general system data; The computing unit is in three-dimensional integrated connection with the storage unit and generates a three-dimensional integrated connection channel, and comprises at least one neural network processing unit which is configured to receive externally input characteristic data and weight data read from the storage unit and execute neural network related computation; The interface device is used for carrying out data transmission and instruction interaction with external equipment; The control module is arranged in the storage unit and used for controlling the storage unit to execute the storage request of the external equipment and the calculation request of the calculation unit in the same clock period, wherein the storage unit allows the calculation unit to access a first subset formed by a plurality of storage arrays in the storage unit through the three-dimensional integrated connection channel so as to execute the calculation request of the calculation unit; The storage unit allows the external device to access a second subset of the plurality of storage arrays in the storage unit through the interface device to execute a storage request of the external device, wherein the first subset is different from the second subset.
2. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the first subset and the second subset are non-overlapping sets of storage arrays and conflict-free parallel access of the external device and the computing unit to the storage units is achieved through physical isolation.
3. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the control module includes at least one of a multiport memory controller, an internal crossbar, or an arbiter circuit for arbitrating the neural network processing unit access requests and access requests from external devices and routing both types of requests to the corresponding storage arrays of the first subset or the second subset, respectively.
4. The memory integrated, three-dimensional integrated and memory parallel-based AI processor of claim 1 wherein the three-dimensional integrated connection employs a three-dimensional integration technique selected from at least one of hybrid bonding, through silicon vias, flip chip or micro bump connections.
5. The memory-based, three-dimensional integrated and memory-parallel AI processor of claim 1 wherein the interface means meets JEDEC memory standard protocols selected from at least one of LPDDR4, LPDDR4X, LPDDR, LPDDR5X, LPDDR5T, LPDDR6, DDR4, DDR5, DDR6, GDDR5, GDDR6, GDDR7, HBM2, HBM3, HBM 4.
6. The memory-integration, three-dimensional integration and memory-computation-parallelism-based AI processor of claim 1, wherein the control module supports a fixed ratio of the number of memory arrays of the first subset to the second subset to adapt to universal memory requirements of external devices and neural network computing requirements of computing units in different application scenarios.
7. The memory-integration, three-dimensional integration and memory-computation-parallelism-based AI processor of claim 1, wherein the control module supports dynamic adjustment of the storage array quantity ratio of the first subset to the second subset by external instructions or configuration parameters to adapt to general storage requirements of external devices and neural network computation requirements of computation units in different application scenarios.
8. The AI processor based on the integration, three-dimensional integration and parallel computation of claim 1, wherein the control module is further configured with a task priority management mechanism, wherein the task priority management mechanism is used for numbering access tasks initiated by external equipment and access tasks initiated by a computing unit, supporting a user to perform priority precoding on task numbers, and the control module allocates storage array resources according to task priorities, and allocates a larger number of storage arrays for tasks with higher priorities.
9. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the control module is further configured with a task load automatic recognition mechanism: and numbering the memory access tasks initiated by the external equipment and the memory access tasks initiated by the computing unit, and automatically judging the storage load pressure of each task by the control module according to the task numbers and the task execution stages, and distributing more storage arrays for the tasks with higher storage load pressure.
10. The AI processor of claim 1 wherein the control module implements storage array partition management by mapping logical addresses to physical addresses: And mapping the physical storage address of the storage unit into the logic address which can be identified by the external equipment and the computing unit, supporting to modify a mapping table of the logic address and the physical address, enabling the logic address of the external equipment to be only mapped to the physical address corresponding to the second subset in a directed manner, and enabling the logic address of the computing unit to be only mapped to the physical address corresponding to the first subset in a directed manner.
11. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the control module implements memory array partition management via a predefined instruction mechanism: the external device is supported to send predefined instructions to the AI processor or write data to a specific address segment through the interface means, The control module identifies partition configuration requirements corresponding to predefined instructions or specific address segment data, and partitions the storage array according to the requirements.
12. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the physical implementation of the memory cells is one or more DRAM chips, SRAM chips, or nonvolatile memory chips, the plurality of memory arrays being distributed as independent logic cells within the memory cells capable of independently responding to access requests.
13. The memory-based, three-dimensional integrated and memory-parallel AI processor of claim 1 wherein the neural network processing unit of the computing unit performs neural network computations directly inside the first subset of the storage units through a memory-based array without carrying weight data in the first subset to the computing unit, the memory-based implementation being one or more of SRAM, reRAM, DRAM, MRAM, IGZO devices.
14. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the AI processor is configured to: when the external equipment accesses the second subset through the interface device, the response process is not blocked by the access operation of the computing unit to the first subset, so that the real-time access and storage requirement of the external equipment is met.
15. The memory-integration, three-dimensional integration, and memory-parallelism-based AI processor of claim 1, wherein the AI processor is configured to enable two functions simultaneously: performing neural network calculations as a neural network accelerator by the first subset, and And providing a general storage service for the external device through the second subset as a standard system memory module.
16. A control method of AI processor based on memory integration, three-dimensional integration and memory parallelism, applying AI processor according to any one of claims 1-15, characterized by comprising the steps of: dynamically judging the memory load pressure accessed by the external equipment and the computing unit, and The number of storage arrays within the first subset and the second subset is allocated based on dynamic changes in the storage load pressure.

Description

AI processor based on memory and calculation integration, three-dimensional integration and memory and calculation parallelism Technical Field The application relates to the technical field of artificial intelligence, in particular to an AI processor based on integrated memory and calculation, three-dimensional integration and memory and calculation parallelism and a control method thereof. Background When the neural network processor is used to perform the neural network calculations, it occupies a single system memory channel. For a main SoC this means that the number of devices available in the system for general memory functions is reduced, which directly leads to two negative consequences-a reduction in the total memory capacity of the system (because of the reduced number of memory channels), a reduction in the total memory access bandwidth of the system (because the system cannot be addressed by interleaving over the full memory address range). The resource conflict seriously affects the overall performance of the system when the system runs non-AI tasks, and the system can only efficiently accelerate AI calculation or smoothly run general application programs, but is difficult to realize efficient running of the two simultaneously on the premise of not sacrificing the performance of any party. This degradation in system performance due to exclusive resources becomes a key technical hurdle that prevents the wide use of such high performance PIM processors. Thus, while the use of standard memory interfaces solves the compatibility and flexibility problems, this design choice introduces a new and more tricky problem at the system level, namely competition for memory resources. When the neural network processor performs neural network computation, since the memory array of the memory layer is accessed by the computation layer, the system main control SoC cannot access the memory array of the memory layer, so that when the AI task occupies one memory channel of the system for computation, the channel cannot be connected with a conventional system memory (such as DRAM particles) any more, which results in reduction of the number of devices available for general memory functions in the system, and directly affects the overall memory capacity and memory access bandwidth of the system. More deeply, this is not just a problem of bandwidth being split, but rather a serious problem of concurrency and quality of service (QoS). If the host system and the compute layer share the memory layer in a Time division multiplexed (Time-Division Multiplexing) manner, i.e., the memory layer can only support one of the memory mode or the compute mode at the same Time, then the device can only respond to one task at any Time. When the neural network processing device performs a computational task, any access requests to system memory by the master CPU will be blocked, and vice versa. Such blocking behavior is unacceptable in modern interactive computing systems. For example, on a smart phone, the UI rendering thread needs to access memory frequently and with low latency while the user is doing the operations of sliding the screen, watching video, etc. If a background AI task (such as image recognition) occupies a memory interface at this time, the UI thread is blocked, which causes screen blocking and frame dropping, and seriously affects the user experience. Thus, the nature of the problem is not a reduction in overall bandwidth, but rather that access by the host to critical memory resources becomes unpredictable and extremely delayed, severely compromising the real-time response capability and stability of the system. The prior art fails to provide a mechanism that allows simultaneous, non-blocking access to both AI acceleration functions and general purpose storage functions through a single physical memory interface. Accordingly, there is a need for improvements to existing neural network processors. Disclosure of Invention In view of the above, the present application provides an AI processor (i.e., a neural network processor mentioned hereinafter) based on memory-arithmetic integration, three-dimensional integration, and memory-arithmetic parallelism, comprising: The memory unit comprises a plurality of memory arrays which can be independently addressed and accessed in parallel and is used for storing the weight data of the neural network, the input characteristic data of the neural network, the intermediate calculation result of the neural network and the general system data; The computing unit is in three-dimensional integrated connection with the storage unit and generates a three-dimensional integrated connection channel, and comprises at least one neural network processing unit which is configured to receive externally input characteristic data and weight data read from the storage unit and execute neural network related computation; The interface device is used for carrying out data transmission and instruction i