CN-121996611-A - AI processor and method based on memory and calculation integration, three-dimensional integration and operator separation

CN121996611ACN 121996611 ACN121996611 ACN 121996611ACN-121996611-A

Abstract

The application relates to an AI processor and a method based on memory and calculation integration, three-dimensional integration and operator separation, comprising a storage layer, a calculation layer, a standardized memory interface, a scheduling module and a control module, wherein the calculation layer is stacked with the storage layer in the vertical direction through a three-dimensional integrated connection structure and generates a three-dimensional integrated high-speed channel for executing calculation tasks in a large language model, the standardized memory interface is used for being connected with an external main control chip, the calculation tasks comprise a pre-filling stage and a decoding stage, the scheduling module configures the main control chip to interact data with the storage layer through the standardized memory interface so as to execute the pre-filling stage, the scheduling module configures the calculation layer to interact data with the storage layer through the three-dimensional integrated high-speed channel so as to execute the decoding stage by utilizing the high-computation force advantage processing pre-filling stage of the main control chip, and the three-dimensional stacked high-bandwidth advantage processing decoding stage is utilized, so that accurate matching of computation force and bandwidth is realized, and the reasoning efficiency of the large model is remarkably improved.

Inventors

WANG ZHIXUAN
GE XIAOHUAN
CHEN PEIYU
LIU YING

Assignees

无锡微纳核芯电子科技有限公司
杭州微纳核芯电子科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (14)

1. An AI processor based on memory integration, three-dimensional integration and operator separation, wherein the AI processor comprises at least: The storage layer comprises at least one storage chip and is used for storing parameters, key value caches and intermediate data of the large language model; The computing layer is stacked with the storage layer in the vertical direction through a three-dimensional integrated connection structure and generates a three-dimensional integrated high-speed channel, and the computing layer comprises a processing unit based on a memory calculation integrated technology and is used for executing computing tasks in a large language model; The standardized memory interface is used for being connected with an external main control chip to realize data communication and control signal interaction between the AI processor and the main control chip; The scheduling module is used for coordinating the access rights of the main control chip and the computing layer to the storage layer and controlling the distribution and execution of computing tasks between the main control chip and the device according to the computing stage of large language model reasoning; wherein the computing task comprises a pre-filling stage and a decoding stage; The scheduling module configures the main control chip to interact data with the storage layer through the standardized memory interface so as to execute the pre-filling stage; The scheduling module configures the computing layer to interact data with the storage layer through the three-dimensional integrated high-speed channel to perform the decoding stage.
2. The AI processor of claim 1, wherein the scheduling module includes a phase arbiter to generate a first access signal to allow the external host chip to read from and write to the storage layer during the pre-fill phase and to generate a second access signal to allow the computing layer to read from and write to the storage layer during the decode phase.
3. The AI processor of claim 1, wherein the scheduling module further comprises a refresh controller to manage data refresh operations for the storage layer and to perform supplemental refresh operations for the storage layer during a mode switch time window of the pre-fill phase and the decode phase to maintain data integrity of the key-value cache.
4. The AI processor of claim 1, wherein the key value cache calculated by the master control chip during the pre-fill phase is written to the storage layer via the standardized memory interface; The calculation layer reads the key value cache from the storage layer through the three-dimensional integrated high-speed channel in the decoding stage, and performs decoding calculation by using a memory calculation integration technology to generate an output Token.
5. The AI processor of any of claims 1-4, wherein the three-dimensional integration technique employed by the three-dimensional integrated connection structure includes at least one of through-silicon vias, hybrid bonding, and micro-bumps.
6. The AI processor of any of claims 1-4, wherein the memory type employed by the computational array of the computational unit based on a computational technique comprises volatile memory or non-volatile memory including at least one of dynamic random access memory, static random access memory, resistive random access memory, magnetoresistive random access memory, ferroelectric random access memory.
7. A large language model processing method based on memory integration, three-dimensional stacking and operator separation, applied to a system comprising a main control chip and an AI processor, wherein the AI processor adopts the structure of any one of claims 1-6, and the method comprises the following steps: Responding to an reasoning request initiated by the main control chip, and configuring access authority into a first mode by a scheduling module of the AI processor; In the first mode, the main control chip executes first-stage calculation of a large language model, and writes generated first intermediate data into a storage layer of the AI processor through an interface; the scheduling module receives a first-stage completion signal sent by the main control chip, switches the access right into a second mode and triggers the starting of a computing layer of the AI processor; in the second mode, the computing layer directly reads the first intermediate data from the storage layer, executes second-stage computation of the large language model, and writes back the computation result to the storage layer or feeds back the computation result to the main control chip.
8. The method of claim 7, wherein after the scheduling module receives the first stage completion signal sent by the master chip and before switching the access rights to the second mode, the method further comprises: The scheduling module controls to perform a supplemental refresh operation on the storage layer, the supplemental refresh operation being used to prevent data loss in the storage layer due to refresh interruption during permission switching.
9. The method of claim 7, wherein the first stage computation is a pre-fill computation, the second stage computation is a decode computation, and the first intermediate data is a key-value cache; In the second mode, the computing layer performs an attention mechanism operation and a Token generation operation using a memory array, and during the computing layer performs computation, the main control chip is in a standby state or processes non-inference tasks.
10. The method of claim 7, wherein the first stage calculation is an attention mechanism calculation and the second stage calculation is a feed forward neural network calculation; the main control chip is responsible for the attention mechanism calculation with high parallelism, and the AI processor is responsible for the feedforward neural network calculation with high bandwidth requirement.
11. The method of claim 9, wherein the step of the computing layer performing a second stage of computation comprises: The calculation layer reads the key value cache and the current Token characteristic data from the storage layer and performs matrix vector multiplication operation through an in-memory calculation matrix; And circularly executing the steps until an ending symbol is generated, and sending an interrupt request or a final result to the scheduling module.
12. A server comprising a memory and one or more processors, the memory for storing instructions for execution by the processors, and the processor for performing the method of any one of claims 7 to 11.
13. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 7 to 11.
14. A computer readable storage medium having instructions stored thereon, which when executed on a computer, cause the computer to perform the method of any of claims 7 to 11.

Description

AI processor and method based on memory and calculation integration, three-dimensional integration and operator separation Technical Field The present application relates to the field of computer technology, and in particular, to an AI processor and processing method, a server, a computer program product, and a computer readable storage medium based on a memory-computing integration, three-dimensional integration, and operator separation computing method. Background Along with the proliferation of the local intelligent interaction demands of end-side devices such as smart phones, notebook computers and intelligent robots, reasoning applications of a large language model (Large Language Model, LLM) gradually migrate from the cloud to the edge side. Under the constraints of limited equipment volume, power consumption and cost, low-delay and high-smoothness Token generation is realized, and the key to meeting application requirements of dialogue interaction, real-time content generation and the like is already realized. The reasoning process of the large language model presents obvious staged difference, the core is divided into two key stages of pre-filling (Prefill) and decoding (decoding), and the two stages have intrinsic contradiction to the requirement of hardware resources, namely, the pre-filling stage is that the model needs to process Prompt words (Prompt) input by a user in parallel to generate an initial key value Cache (KV Cache). The computation of this stage is square with the input Token length, which is a typical "computationally intensive" task. Taking the Llama3-8B int8 quantization model as an example, the calculation force of about 1.5 TOPS is required to be output at the stage to ensure the processing efficiency, but the continuous pressure on the memory bandwidth is relatively small. And in the decoding stage, the model generates output Token one by one in an autoregressive mode based on the generated KV Cache. At this time, the single-step calculation amount is extremely small (only a few percent of the pre-filling stage), but every time a new Token is generated, all the historical KV caches are required to be read from the memory to perform the attention mechanism operation. This results in extremely high demands on memory bandwidth (e.g., up to 800 GB/s) at this stage, which is a typical "bandwidth intensive" task. However, the existing end-side hardware architecture is difficult to adapt to the differential requirements of the two stages at the same time, and has a serious problem of "resource supply and demand mismatch": On the one hand, the main control chip (such as SoC, CPU, GPU) in the end-side device, which bears the main computing task, usually has a relatively strong computing capability, and is suitable for executing the computing task in the pre-filling stage. But is limited by the number of chip pins and interface design criteria (e.g., PCIe or LPDDR interfaces) and its external data interaction bandwidth is generally small (typically at 150-300 GB/s), well below the high bandwidth requirements of the decoding stage. If the main control chip is forced to execute full-flow reasoning, serious bandwidth bottleneck is faced in the decoding stage, so that Token generation delay is high, blocking is obvious, and real-time interaction cannot be realized. On the other hand, conventional memory chips (e.g., LPDDR, GDDR) are small and easy to integrate, but have only data storage functions without efficient computing power. And the method is limited by the traditional 2D packaging technology, so that access bandwidth and interconnection density are difficult to break through the bottleneck, and the method can not meet the bandwidth requirement of a decoding stage and can not share the calculation pressure. If a large-area special AI acceleration chip is additionally integrated to improve the calculation force, the equipment cost and the volume exceed the standard, and the limit requirement of the end-side equipment is not met. Although the integrated-in-Memory (TSV) technology and the three-dimensional integration technology (such as TSV and hybrid bonding) provide possibility for breaking through the bandwidth wall, the existing three-dimensional stacking scheme mostly adopts the whole process of unified hardware architecture processing reasoning, and cannot effectively divide the characteristics of "large computing power and small bandwidth" of the main control chip and "small area and high bandwidth requirement" of the Memory chip. In practical application, the problems of idle bandwidth, waste of calculation power and insufficient bandwidth in the pre-filling stage and the decoding stage still exist, and the overall reasoning efficiency is low and the power consumption is high. Therefore, a processor architecture and a processing method capable of adapting to end-side hardware constraint and precisely matching with the two-stage resource requirements of LLM are needed to solve t