CN-122019084-A - Chip design method for NPU and DSP combined acceleration AI model

CN122019084ACN 122019084 ACN122019084 ACN 122019084ACN-122019084-A

Abstract

The application relates to the technical field of chip architecture design, and discloses a chip design method for an NPU and DSP combined acceleration AI model, which comprises the following steps of executing chip internal authority initialization; the method comprises the steps of executing authority verification and access scheduling based on a synchronous perception type cross-core access control mechanism, constructing a synchronous cooperation relation between NPU and DSP, reasonably distributing sub-graph tasks to a heterogeneous computing unit, and carrying out optimal scheduling through heterogeneous core load similarity mapping and an adaptive migration decision mechanism. Compared with the prior art that heterogeneous core task partitioning is only carried out based on a static schedule or a preset calculation load, the method and the system have the advantages that the technical problem that task granularity cross-core migration scheduling is difficult to realize in complex AI model reasoning scenes with frequent operator distribution dynamic changes or execution congestion is solved.

Inventors

SU CHEN
GUO KAIYUAN
CHEN ZHONGMIN
LIANG SHUANG

Assignees

北京超星未来科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. A chip design method for an NPU and DSP combined acceleration AI model is characterized by comprising the following steps: step S10, executing an internal initialization task inside the chip by adopting a sectional type authority binding and dynamic boundary solidifying mechanism through an on-chip control register, and outputting an authority configuration table ; Step S20, configuring the table based on the authority Executing collaborative instruction analysis and permission verification tasks by adopting a synchronous perception type cross-core access control mechanism, and outputting a cross-core synchronous instruction set; step S30, executing a synchronous handshake task between the NPU and the DSP by adopting a dual-path handshake mechanism based on event association based on a cross-core synchronous instruction set, and outputting synchronous result information; Step S40, acquiring ONNX diagram structure of an AI model, executing subtask dispatch tasks of the AI model by adopting a task scheduling perception type selection mechanism based on ONNX diagram structure and synchronization result information, and outputting a mapping task structure body queue; s50, executing mapping task scheduling tasks based on the mapping task structure body queue by adopting a heterogeneous core load similarity mapping and adaptive migration decision mechanism, and outputting a final scheduling result set 。
2. The method for designing a chip of an NPU and DSP joint acceleration AI model as set forth in claim 1, wherein in step S10, internal initialization tasks are executed inside the chip by an on-chip control register using a sectional authority binding and dynamic boundary curing mechanism, and an authority configuration table is outputted Specifically comprises the following steps: Step S101, generating a three-dimensional binding relation by an on-chip control register aiming at a shared static random access memory, wherein the three-dimensional binding relation comprises an NPU exclusive partition, a DSP exclusive partition, a bidirectional interaction partition and a configuration state partition; Step S102, executing dynamic boundary validity check for the bidirectional interaction partition in the three-dimensional binding relation, and outputting dynamic boundary validity index ; Step S103, based on dynamic boundary validity index Writing legal interval boundary in the bidirectional interactive partition into a disposable curing register to form a curing result set, and generating an authority configuration table based on the curing result set Permission configuration table The method comprises the steps of accessing a main body Identification (ID), an address boundary range and an operation authority vector, wherein the access main body Identification (ID) is used for distinguishing whether an access origin is an NPU or a DSP, the address boundary range is used for representing a shared storage physical address range which can be accessed by an access main body, the operation authority vector is used for defining an access type set of the access main body to a target address field, and the access type set comprises a read type bit0, a write type bit1 and an execution type bit2.
3. The method for designing a chip of an NPU and DSP joint acceleration AI model as set forth in claim 1, wherein in step S20, based on a permission configuration table Executing collaborative instruction analysis and permission verification tasks by adopting a synchronous perception type cross-core access control mechanism, and outputting a cross-core synchronous instruction set, wherein the method specifically comprises the following steps of: step S201, firstly constructing an Access request Access, wherein the Access request Access comprises a main body ID, a target address, an operation type and a synchronous dependent tag ; Step S202, configuring the table based on the Access request Access and the authority Executing three-dimensional combined validity check, wherein the three-dimensional combined validity check comprises address boundary validity judgment, operation authority validity judgment and synchronous state validity judgment; Step S203, when any one of the judging results in the three-dimensional combined validity check is illegal, the following actions are executed, namely, a validity check code is written into an error state register, the position of the error type written into the error state register is 1, and meanwhile, an Access request Access is blocked; and outputting a cross-core synchronous instruction set when the judging result of all the items in the three-dimensional combined validity check is legal, wherein the cross-core synchronous instruction set comprises a master-slave mode trigger instruction set, a layer data sharing instruction set and a synchronous mode instruction set.
4. The method for designing a chip of an NPU and DSP joint acceleration AI model of claim 1, wherein in step S30, a dual-path handshake mechanism based on event association is adopted to perform a synchronization handshake task between the NPU and the DSP based on a cross-core synchronization instruction set, and the step of outputting synchronization result information specifically comprises: step 301, acquiring a synchronous mode instruction set from a cross-core synchronous instruction set; Step S302, if the synchronous mode instruction set shows that a hard-wire handshake mode is currently used, the NPU sends a synchronous request to the DSP by setting a handshake request signal hsk [0 ]; If the synchronous mode instruction set shows that the interrupt handshake mode is currently used, the NPU triggers the DSP to enter an interrupt service flow by sending an interrupt trigger instruction to the DSP, and the DSP returns an interrupt response state signal; step S303, based on the handshake response signal hsk [1] and the interrupt response state signal, judging whether the synchronous handshake task is successfully executed, and outputting synchronous result information.
5. The chip design method of the NPU and DSP joint acceleration AI model of claim 4, wherein in step S30, the success determination condition of the synchronization handshake task is: Under the condition that a hard-wire handshake mode is currently used, the NPU receives a handshake response signal hsk [1], and hsk [1] =1, and simultaneously meets the condition that the handshake timeout count does not exceed a preset timeout count threshold; in the current situation of using the interrupt handshake mode, it is satisfied that the NPU monitors the interrupt response status signal and the interrupt response status signal is zero.
6. The method for designing a chip of an NPU and DSP joint acceleration AI model according to claim 1, wherein in step S40, a ONNX diagram structure of the AI model is obtained, a task scheduling aware selection mechanism is adopted to execute a subtask assignment task of the AI model based on the ONNX diagram structure and synchronization result information, and a step of outputting a mapped task structure queue specifically includes: Step S401, acquiring ONNX graph structure of AI model and available computation core set based on synchronization result information, constructing candidate sub-graph set by ONNX graph structure using GraphCuts algorithm, aiming at ith candidate sub-graph in candidate sub-graph set Extracting to obtain an operator set by adopting a structure vector extraction method based on topological attributes Operator set The method comprises the steps of including load parameters and communication topology parameters; step S402, based on the operator set Calculating to obtain an operator structure coupling score by adopting a structure similarity measurement method, and preferentially selecting the ith candidate subgraph when the operator structure coupling score is greater than an operator structure coupling score threshold of the threshold Assigning NPU to process, otherwise, defaulting to give priority to the ith candidate sub-graph Dispatch to DSP for processing, output dispatch structure set; And S403, sequentially mapping and arranging in the computing core set based on the scheduling structure set, and finally outputting a mapping task structure queue.
7. The method for designing a chip of an NPU and DSP joint acceleration AI model as set forth in claim 1, wherein in step S50, mapping task scheduling tasks are executed based on mapping task structure queues by adopting a heterogeneous core load similarity mapping and adaptive migration decision mechanism, and a final scheduling result set is output Specifically comprises the following steps: step S501, a periodic in-core load perception vector construction stage, namely acquiring the number of the current active operators of the NPU and the number of the total queuing operators of the NPU by taking 10ns as a period, and defining the NPU core load factor based on the number of the current active operators of the NPU and the number of the total queuing operators of the NPU Simultaneously taking 10ns as a period to obtain the number of current active operators of the DSP and the number of total queuing operators of the DSP, and defining a DSP core load factor based on the number of current active operators of the DSP and the number of total queuing operators of the DSP According to NPU core load factor And DSP core loading factor Generating periodic intra-nuclear load sense vectors ; Step S502 rescheduling determination stage based on periodic intra-nuclear load sense vector Calculating with a preset NPU operator structure template by adopting a similarity analysis principle based on normalized cosine distance to obtain a first load similarity score And based on periodic intra-nuclear load sense vectors Calculating a second load similarity score with a preset DSP operator structure template by adopting a similarity analysis principle based on Manhattan distance Based on the first load similarity score And a second load similarity score Calculating a load similarity score difference When the load similarity score is different When the load similarity score difference threshold value is larger than the preset load similarity score difference threshold value, executing task migration between the NPU and the DSP in the direction of smaller similarity score; step S503, reordering the original mapping task structure body queue after completing task migration, and outputting a final scheduling result set 。
8. The method for designing a chip for an NPU and DSP joint acceleration AI model of claim 2, wherein in step S102, the dynamic boundary validity check includes a section overlap determination, a body consistency determination, and an operation closed loop determination.
9. The method for designing a chip for jointly accelerating an AI model by NPU and DSP as claimed in claim 3, wherein in step S201, when When 1, it indicates that the Access request Access can be accessed only after the synchronization is completed, when When 0, the support of the direct Access construction Access request Access is indicated.
10. The method of claim 6, wherein in step S40, the finally output mapping task structure queue includes a sub-graph number, a target core ID, a predicted start time stamp, and a dependency list.

Description

Chip design method for NPU and DSP combined acceleration AI model Technical Field The invention relates to the technical field of chip architecture design, in particular to a chip design method for an NPU and DSP combined acceleration AI model. Background Currently, with urgent demands of edge computing and intelligent terminals for complex AI reasoning capability, more and more AI chip architectures adopt integrated heterogeneous acceleration schemes, and a better balance among power consumption, computing power and on-chip resources is achieved through collaborative deployment of a neural Network Processing Unit (NPU) and a Digital Signal Processor (DSP). Typical schemes such as ARM and NPU collaborative systems, high-pass Hexagon architecture, etc. have been widely used in mobile, vehicle-mounted and industrial devices. However, in the prior art architecture, the reasoning process of the AI model generally faces the following key problems and technical bottlenecks that firstly, in the process of cross-core deployment of model operators, the existing system distributes operator subgraphs by depending on static rules or offline strategies, and the linkage perception of operator structure topology, data communication coupling degree and real-time load state in the core is lacked. Particularly, under the condition that a model structure is complex (such as a multi-branch deep residual network) or task dynamic change is frequent, efficient inter-core task allocation and scheduling are difficult to realize, partial subgraphs are scheduled to a computing unit with unmatched resources, and the problems of load bottleneck, cache jitter, reasoning delay increase and the like are generated. Secondly, in the aspect of cross-core communication control, a synchronization mechanism between the traditional NPU and the DSP mostly adopts simple event marker quantity or interrupt trigger, and lacks dynamic modeling capability of event context and dependent paths. In addition, in the inter-core access control of the existing chip, fixed address mapping and static authority configuration are mostly adopted, so that accurate constraint on access behaviors under a dynamic scheduling structure is difficult to realize, and the risk of unauthorized access or illegal rewriting of instructions exists particularly in a scene with higher safety requirements. Therefore, it is needed to propose a chip design method for the NPU and DSP combined acceleration AI model, which integrates ONNX model structure understanding, inter-core authority dynamic curing, synchronous event path modeling and load-aware scheduling mechanisms, so as to realize efficient, safe and stable operation of the complex AI model under heterogeneous chip architecture. Disclosure of Invention Aiming at the technical defects, the invention aims to provide a chip design method for an NPU and DSP combined acceleration AI model, and aims to solve the technical problem that task granularity cross-core migration scheduling is difficult to realize in a mode of carrying out heterogeneous core task division only based on a static schedule or a preset calculation load in the prior art, especially in a complex AI model reasoning scene with operator distribution dynamic change or frequent execution congestion. In order to solve the technical problems, the invention adopts the following technical proposal that the invention provides a chip design method of an NPU and DSP combined acceleration AI model, The chip design method of the NPU and DSP combined acceleration AI model comprises the following steps: step S10, executing an internal initialization task inside the chip by adopting a sectional type authority binding and dynamic boundary solidifying mechanism through an on-chip control register, and outputting an authority configuration table ; Step S20, configuring the table based on the authorityExecuting collaborative instruction analysis and permission verification tasks by adopting a synchronous perception type cross-core access control mechanism, and outputting a cross-core synchronous instruction set; step S30, executing a synchronous handshake task between the NPU and the DSP by adopting a dual-path handshake mechanism based on event association based on a cross-core synchronous instruction set, and outputting synchronous result information; Step S40, acquiring ONNX diagram structure of an AI model, executing subtask dispatch tasks of the AI model by adopting a task scheduling perception type selection mechanism based on ONNX diagram structure and synchronization result information, and outputting a mapping task structure body queue; s50, executing mapping task scheduling tasks based on the mapping task structure body queue by adopting a heterogeneous core load similarity mapping and adaptive migration decision mechanism, and outputting a final scheduling result set 。 Preferably, in step S10, the on-chip control register performs an internal initialization task on the chi