CN-121998006-A - Three-dimensional integrated heterogeneous storage medium storing and calculating integrated neural network processor

CN121998006ACN 121998006 ACN121998006 ACN 121998006ACN-121998006-A

Abstract

The application relates to a memory-calculation integrated neural network processor based on a three-dimensional integrated heterogeneous storage medium and a control method thereof, wherein the memory-calculation integrated neural network processor comprises an off-chip nonvolatile storage chip, an off-chip volatile storage chip and a logic chip, wherein the off-chip nonvolatile storage chip is connected with the logic chip and used for carrying out information interaction with the logic chip, the off-chip volatile storage chip is connected with the logic chip and used for carrying out information interaction with the logic chip, the logic chip is configured to realize any combination of the following functions of executing a calculation task, carrying out data caching and realizing information interaction, the off-chip volatile storage chip, the logic chip and the off-chip nonvolatile storage chip are integrated by adopting a three-dimensional stacking technology, the off-chip volatile storage chip is positioned between the off-chip nonvolatile storage chip and the logic chip, the off-chip volatile storage chip is used as an efficient cache layer, and high delay of the off-chip nonvolatile storage chip is hidden in a parallel window for reading and real-time calculation, so that efficiency loss caused by the delay of the off-chip nonvolatile storage chip is eliminated.

Inventors

WANG ZHIXUAN
YE LE
CHEN PEIYU
LIU YING

Assignees

杭州微纳核芯电子科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (20)

1. A memory-to-compute unified neural network processor for three-dimensional integrated heterogeneous storage media, comprising: The off-chip nonvolatile memory chip is connected with the logic chip and is used for carrying out information interaction with the logic chip; an off-chip volatile memory chip connected with the logic chip for information interaction with the logic chip, and The logic chip is configured to implement a combination of any of the following functions: Executing a computing task, carrying out data caching and realizing information interaction; The off-chip volatile memory chip, the logic chip and the off-chip nonvolatile memory chip are integrated by adopting a three-dimensional stacking technology, and the off-chip volatile memory chip is positioned between the off-chip nonvolatile memory chip and the logic chip in the vertical direction, wherein the off-chip nonvolatile memory chip is connected with the off-chip volatile memory chip, and the off-chip volatile memory chip can temporarily store data stored by the off-chip nonvolatile memory chip for the logic chip to call.
2. The three-dimensional integrated heterogeneous storage medium memory-integrated neural network processor according to claim 1, wherein the off-chip nonvolatile memory chip, the off-chip volatile memory chip and the logic chip are integrated by a three-dimensional stacking technology in sequence from top to bottom in a vertical direction.
3. The three-dimensional integrated heterogeneous memory medium memory-integrated neural network processor according to claim 1, wherein the logic chip, the off-chip volatile memory chip and the off-chip nonvolatile memory chip are sequentially integrated by a three-dimensional stacking technology from top to bottom in a vertical direction.
4. A memory-integrated neural network processor of a three-dimensional integrated heterogeneous storage medium according to any of claims 1-3, wherein the logic chip is a computing chip that performs computing tasks, the logic chip being integrated with a digital in-memory computing unit.
5. The three-dimensional integrated heterogeneous storage medium memory-co-operation neural network processor of any of claims 1-3, wherein the logic chip is a data cache chip for data caching.
6. A three-dimensional integrated heterogeneous storage medium memory-integrated neural network processor according to any of claims 1-3, wherein the logic chip is an interface chip for implementing information interaction.
7. The three-dimensional integrated heterogeneous storage medium of claim 4 wherein said logic chip has integrated thereon at least one off-chip volatile memory chip controller, at least one off-chip nonvolatile memory chip controller, and a global controller, wherein: The off-chip volatile memory chip controller is used for accessing the off-chip volatile memory chip; The off-chip nonvolatile memory chip controller is used for accessing the off-chip nonvolatile memory chip; The global controller is used for managing the off-chip nonvolatile memory chip controller, the off-chip volatile memory chip controller and the digital memory internal computing unit, and realizing dynamic adaptation of a data storage scheme and a computing data stream so as to globally optimize storage capacity, data bandwidth, access delay and computing efficiency.
8. The three-dimensional integrated heterogeneous storage medium memory-co-neural network processor of any of claims 1-3, wherein the off-chip volatile memory chip is a DRAM chip.
9. The three-dimensional integrated heterogeneous storage medium memory-integrated neural network processor of claim 8, wherein in a large language model computing scenario, the off-chip volatile memory chip is configured to store Key-Value cache data, an activation Value, and intermediate computing results.
10. The three-dimensional integrated heterogeneous storage medium memory-integrated neural network processor of any of claims 1-3, wherein the off-chip non-volatile memory chip is Flash.
11. The three-dimensional integrated heterogeneous storage medium of claim 8, wherein in a large language model computing scenario, the off-chip nonvolatile memory chip is configured to store Q-generation weights, K-generation weights, V-generation weights, attention weights, and FFN weights.
12. The three-dimensional integrated heterogeneous storage medium of any of claims 1-3, wherein said off-chip volatile memory chip is comprised of an off-chip volatile memory module, or, The three-dimensional stacking technology is applied to at least two off-chip volatile memory modules for integration, wherein each off-chip volatile memory module is integrated from front to front, one side, close to a logic control layer, of a structure formed by three-dimensional stacking integration of the at least two off-chip volatile memory modules is defined as the front of the off-chip volatile memory chip, and one side, close to a memory array layer, is defined as the back of the off-chip volatile memory chip.
13. The three-dimensional integrated heterogeneous storage medium of any of claims 1-3, wherein said off-chip nonvolatile memory chip is comprised of an off-chip nonvolatile memory module or, And integrating at least two off-chip nonvolatile memory modules by using a three-dimensional stacking technology, wherein each off-chip nonvolatile memory module is integrated from front to front, one side, close to a logic control layer, of a structure formed by three-dimensional stacking integration of the at least two off-chip nonvolatile memory modules is defined as the front of the off-chip nonvolatile memory chip, and one side, close to a memory array layer, is defined as the back of the off-chip nonvolatile memory chip.
14. The three-dimensional integrated heterogeneous storage medium of claim 4, wherein said logic chip is comprised of a logic module, or, The three-dimensional stacking technique is applied by at least two logic modules for integration.
15. The three-dimensional integrated heterogeneous storage medium of claim 1, wherein the three-dimensional stacking technique is one or more of a through-silicon via technique, a hybrid bonding technique, or a flip chip technique.
16. The three-dimensional integrated heterogeneous storage medium of claim 4 wherein the digital in-memory computing unit is integrated by an on-chip volatile memory chip and a computing unit, the on-chip volatile memory chip being one or more of SRAM, eDRAM, DRAM, flash, MRAM, reRAM.
17. The three-dimensional integrated heterogeneous storage medium memory-integrated neural network processor of claim 7, wherein the global controller is configured to perform fine dynamic scheduling of data based on bonding scheme interconnection characteristics of the three-dimensional stacking technology and inherent properties of each storage medium.
18. The three-dimensional integrated heterogeneous storage medium of claim 7 wherein the global controller is configured to support configuration of data storage mapping rules by upper layer software.
19. The three-dimensional integrated heterogeneous memory medium of claim 7 wherein said off-chip volatile memory chip comprises an interface conforming to JEDEC standards and adapted to said off-chip volatile memory chip controller such that said neural network processor is accessible by external devices as a standard DRAM memory chip, said interface protocol satisfying one or more of LPDDR5, LPDDR6, HBM2, HBM3e, HBM4, GDDR5, GDDR6, GDDR7, DDR5, DDR 6.
20. The three-dimensional integrated heterogeneous storage medium of claim 10 wherein said Flash comprises an interface conforming to JEDEC or industry universal storage interface standards, said Flash of said neural network processor being accessible by a Flash memory chip considered standard, said interface having a protocol that satisfies one or more of SPI, QSPI, octal SPI, eMMC, UFS, NVMe, PARALLEL FLASH.

Description

Three-dimensional integrated heterogeneous storage medium storing and calculating integrated neural network processor Technical Field The application relates to the technical field of semiconductor chips, in particular to a memory-calculation integrated neural network processor of a three-dimensional integrated heterogeneous storage medium and a control method thereof. Background The large language model (Large Language Models, LLMs) has been a core evolution direction of natural language processing (Natural Language Processing, NLP) technology in the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), and has shown an exponential development situation in recent years. The technical maturity and application breadth of the method are continuously broken through, and the method has the irreplaceable advantages in a plurality of key fields. Meanwhile, the development process of LLMs gradually shows a core rule which is widely verified by the academia and the industry, namely, a scale effect (SCALING LAW). The core connotation of the rule is that under the support of sufficient high-quality training data and matched computing resources, the core performance index (such as confusion Perplexity of language generation and accuracy of downstream tasks) of LLMs and the model parameter scale show a remarkable positive correlation. As shown in fig. 1, the model performance score increases continuously as the number of parameters increases exponentially. Conventional computer systems achieve capacity, speed and cost tradeoffs by "storage system architecture" (register → on-chip cache → main memory → local disk → remote storage), as shown in fig. 3, with each tier exhibiting significant differences in relative cost, storage density, bandwidth, access latency and power consumption. In the existing hardware architecture, in order to meet the requirements of large capacity and non-volatile storage, an off-chip non-volatile memory chip (such as Flash) is generally adopted as a main storage carrier. However, compared to the operation speed of the logic computation unit, the read-write performance of the off-chip nonvolatile memory chip has a significant bottleneck. In particular, the single read latency of non-volatile memory is relatively high, often involving complex underlying operations (e.g., block erase, page read, etc.), resulting in data throughput rates that are difficult to keep pace with the demands of high performance logic chips. Meanwhile, with the application of a Memory-in-Memory technology, the computation power density of the logic chip is greatly improved, and core operations such as multiply-accumulate and the like can be completed in parallel in a Memory unit, so that the computation time is greatly shortened. The contradiction between fast computation and slow fetch is increasingly highlighted, namely, when large model reasoning is performed, a logic chip is often forced to be in an idle state because of waiting for data loading of an off-chip nonvolatile memory chip, and efficient pipeline operation cannot be formed. This computational blockage caused by data supply delays severely constrains the overall energy efficiency and inference speed of the neural network processor. In addition, conventional planar interconnect or non-targeted stacking architecture also fails to effectively shorten the distance and complexity of data handling in the physical path, further exacerbating the complexity and cost of system design. Accordingly, there is a need for improvements to existing neural network processors. Disclosure of Invention The application provides a memory calculation integrated neural network processor of a three-dimensional integrated heterogeneous memory medium, which comprises an off-chip nonvolatile memory chip, an off-chip volatile memory chip and a logic chip, wherein the off-chip nonvolatile memory chip is connected with the logic chip and used for carrying out information interaction with the logic chip, the off-chip volatile memory chip is connected with the logic chip and used for carrying out information interaction with the logic chip, the logic chip is configured to realize any combination of the following functions of executing a calculation task, carrying out data caching and realizing information interaction, the off-chip volatile memory chip, the logic chip and the off-chip nonvolatile memory chip are integrated by adopting a three-dimensional stacking technology, and the off-chip volatile memory chip is positioned between the off-chip nonvolatile memory chip and the logic chip in the vertical direction, wherein the off-chip nonvolatile memory chip is connected with the off-chip volatile memory chip and can temporarily store data stored by the off-chip nonvolatile memory chip for calling by the logic chip. In consideration of the fact that single read delay of an off-chip nonvolatile memory chip is relatively high (operations such as block erasure, page reading and the li