CN-121979532-A - Model processing method and device

CN121979532ACN 121979532 ACN121979532 ACN 121979532ACN-121979532-A

Abstract

The application provides a model processing method and device, and relates to the technical field of artificial intelligence. A processing method of a model comprises the steps of constructing a directed acyclic graph, abstracting a target neural network model to serve as a high-order IR of the model, optimizing on the basis of the high-order IR, wherein the high-order IR comprises operator fusion, reordering, sub-graph division and node multiplexing, so that a graph optimization result is obtained, serving as a low-order IR of the model, and optimizing on the basis of the low-order IR, wherein the low-order IR comprises on-chip operator fusion, instruction side-by-side and hardware instruction mapping, so that an optimization code of the target neural network model is obtained. According to the application, the target neural network model is subjected to layering treatment, corresponding compiling optimization is performed, usability of upper-layer application and extreme performance of bottom-layer hardware are balanced, and a user does not need to deeply care hardware details while fully releasing calculation potential of a data flow architecture, so that programming of the model with high efficiency and high hardware utilization rate is realized.

Inventors

LI MING
Cai Quanxiong
NIU XINYU

Assignees

深圳鲲云信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251229

Claims (11)

1. A method of processing a model, comprising: generating a directed acyclic graph according to a target neural network model based on a programming paradigm, wherein the directed acyclic graph comprises operator nodes and tensor edges; Based on the directed acyclic graph, identifying continuous operator nodes conforming to a fusion rule, and fusing the continuous operator nodes into new operator nodes to obtain an operator fusion result; Sequencing operator nodes in the operator fusion result, and performing sub-division based on the sequencing result to obtain a sub-division result; multiplexing operator nodes needing repeated calculation based on the sub-graph dividing result to obtain a graph optimizing result; Extracting multiple-input operator nodes based on the graph optimization result, and carrying out on-chip operator fusion labeling on the graph optimization result according to the input of the multiple-input operator nodes; in the graph optimization result after on-chip operator fusion labeling, instruction side-by-side labeling is carried out on operator node pairs which are positioned in the same sub-graph and have no data dependence; And mapping the graph optimization result with the instructions marked side by side into a hardware instruction according to a predefined unified operation primitive, thereby obtaining an optimization code of the target neural network model.
2. The method of claim 1, wherein generating the directed acyclic graph from the target neural network model based on the programming paradigm comprises: based on the programming paradigm, identifying the calculation operation of the target neural network model, obtaining an operator, and taking the operator as an operator node; And representing input and output data among layers of the target neural network model as tensors, and adding tensor edges among operator nodes according to the tensors to obtain the directed acyclic graph.
3. The method of claim 1, wherein sorting operator nodes in the operator fusion result and sub-dividing based on the sorting result to obtain a sub-division result comprises: s131, classifying the operator nodes to obtain classification results; s132, grouping operator nodes connected linearly based on the classification result to obtain a plurality of operator node groups; S133, sorting the operator nodes according to a depth-first traversal strategy to obtain a depth sorting result; s134, obtaining the dependency relationship of the operator nodes according to the depth ordering result; S135, sorting the current operator nodes according to the breadth-first traversal strategy to obtain a breadth-first sorting result; S136, extracting a reference operator node and an operator node to be inserted from operator nodes with the same breadth based on the breadth sequencing result, inserting the operator node to be inserted and the operator node with a dependency relationship to the corresponding position of a target operator node meeting a preset condition in the depth sequencing result, and inserting a sub-graph segmentation mark, wherein the preset condition comprises that the operator node to be inserted and the target operator node are positioned in the same operator node group, and the reference operator node depends on the target operator node; S137, repeating the steps S135-S136 until all operator nodes with the same breadth are traversed, and obtaining the sequencing result of the operator nodes; And S138, dividing the directed acyclic graph according to the sub-graph segmentation marks based on the sequencing result to obtain a plurality of sub-graphs as the sub-graph division result.
4. The method of claim 1, wherein extracting multiple-input operator nodes based on the graph optimization result and performing on-chip operator fusion labeling on the graph optimization result according to the input of the multiple-input operator nodes comprises: s151, extracting multiple-input operator nodes from the graph optimization result to serve as current operator nodes; S152, reversely searching an input branch of the current operator node, and storing the operator node in the input branch into an operator list of the current operator node; S153, repeating the steps S151-S152 until operator nodes in the current subgraph are traversed; s154, performing reverse traversal cutting on the operator list according to the maximum dicing strategy to obtain a plurality of fusion blocks; and S155, writing the information of the fusion block back to an operator node of the graph optimization result to be used as the on-chip operator fusion label.
5. The method according to claim 4, wherein step S151 includes: selecting an input operator node of the graph optimization result as a starting node, and searching a branch based on the starting node as a first branch; Based on the first branch, a child node of the starting node is obtained and is used as a target node; And in the case that the target node is a multi-input operator node, taking the target node as the current operator node.
6. The method according to claim 1, wherein mapping the graph optimization result with instructions marked side by side into hardware instructions according to predefined unified operation primitives, thereby obtaining an optimization code of the target neural network model, comprises: Identifying a specific sub-graph mode and/or operator node sequence from the graph optimization result marked by the instruction side by side according to the unified operation primitive to obtain a plurality of operation primitive sequences; And converting the plurality of operation primitive sequences into corresponding hardware instructions so as to obtain the optimized codes of the target neural network model.
7. The method according to claim 1 or 6, wherein the unified operation primitives include memory management class operation primitives, compute operation class operation primitives, scheduler class operation primitives, and/or layout conversion class operation primitives.
8. The method of claim 1, wherein identifying successive operator nodes that meet a fusion rule based on the directed acyclic graph and fusing the successive operator nodes into new operator nodes, resulting in an operator fusion result, comprises: Selecting seed operator nodes from operators in the directed acyclic graph; And fusing the continuous operator nodes meeting the fusion rule downwards based on the seed operator nodes by using a greedy algorithm, and fusing the continuous operator nodes into new operator nodes to obtain the operator fusion result.
9. A model processing apparatus, comprising: the calculation graph unit is used for generating a directed acyclic graph according to the target neural network model based on a programming paradigm, wherein the directed acyclic graph comprises operator nodes and tensor edges; an operator fusion unit, configured to identify continuous operator nodes that conform to a fusion rule based on the directed acyclic graph, and fuse the continuous operator nodes into new operator nodes, so as to obtain an operator fusion result; the sorting and dividing unit is used for sorting operator nodes in the operator fusion result, and performing sub-division based on the sorting result to obtain a sub-division result; the node multiplexing unit is used for multiplexing operator nodes needing repeated calculation based on the sub-graph dividing result to obtain a graph optimizing result; The on-chip fusion unit is used for extracting multiple-input operator nodes based on the graph optimization result and carrying out on-chip operator fusion labeling on the graph optimization result according to the input of the multiple-input operator nodes; The instruction rearrangement unit is used for carrying out instruction side-by-side annotation on operator node pairs which are positioned in the same sub-graph and have no data dependence in the graph optimization result after on-chip operator fusion annotation; And the instruction mapping unit is used for mapping the graph optimization result with the instructions marked side by side into a hardware instruction according to the predefined unified operation primitive, so as to obtain the optimization code of the target neural network model.
10. An electronic device, comprising: One or more processors; A memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
11. A computer readable storage medium having stored thereon a computer program or instructions which, when executed by a processor, implement the method of any of claims 1-8.

Description

Model processing method and device Technical Field The application relates to the technical field of artificial intelligence, in particular to a model processing method and device. Background Currently, with the rapid expansion of applications, many scenarios may employ data stream processing, such as natural language processing and recommendation engines, and the performance and efficiency challenges of conventional instruction set architectures have become apparent. To solve this problem and enable the next generation of scientific and machine learning applications. The data flow architecture becomes a mainstream technical route of an international initial artificial intelligent chip enterprise, and can provide a performance which is several times higher under the condition of the same transistor number, or provide similar performance with fewer transistors, so that the cost performance is greatly improved. And developing a programming model system based on a data flow architecture technology, and providing high-performance, low-delay and high-calculation cost performance calculation software support for computer vision, voice semantics and large model application. However, in terms of software, the model is complex and has numerous frameworks, and those skilled in the application fields are unfamiliar with the data flow architecture hardware processing and lack specialized training for data flow architecture computational development. Resulting in inefficiency and inability to exploit the computational potential of the data flow architecture as it translates into practical productivity. Disclosure of Invention Based on the method and the device, the application provides a model processing method and a model processing device, and programming of the model with high efficiency and high hardware utilization rate is realized. According to one aspect of the application, a processing method of a model is provided, which comprises the steps of generating a directed acyclic graph according to a target neural network model based on a programming paradigm, wherein the directed acyclic graph comprises operator nodes and tensor edges, identifying continuous operator nodes conforming to a fusion rule based on the directed acyclic graph, fusing the continuous operator nodes into new operator nodes to obtain an operator fusion result, sequencing the operator nodes in the operator fusion result, sub-dividing the operator nodes based on the sequencing result to obtain a sub-division result, multiplexing the operator nodes needing repeated calculation based on the sub-division result to obtain a graph optimization result, extracting multiple-input operator nodes based on the graph optimization result, performing on-chip operator fusion labeling on the graph optimization result according to the input of the multiple-input operator nodes, performing instruction side-by-side labeling on operator node pairs which are positioned in the same sub-graph and have no data dependence in the graph optimization result, and mapping the graph optimization result after the instruction side-by-side labeling into a hardware instruction according to a predefined unified operation primitive, so that an optimization code of the target neural network model is obtained. According to some embodiments, a directed acyclic graph is generated according to a target neural network model based on a programming paradigm, and the directed acyclic graph is obtained by identifying computing operations of the target neural network model based on the programming paradigm, obtaining operators and taking the operators as operator nodes, representing input and output data among layers of the target neural network model as tensors, and adding tensor edges among the operator nodes according to the tensors. According to some embodiments, operator nodes in operator fusion results are ordered and sub-graph division is carried out based on the ordering results to obtain sub-graph division results, wherein the sub-graph division results comprise S131, classifying operator nodes to obtain classification results, S132, grouping operator nodes which are linearly connected based on the classification results to obtain a plurality of operator node groups, S133, ordering the operator nodes according to a depth priority traversal strategy to obtain a depth ordering result, S134, obtaining a dependency relationship of the operator nodes according to the depth ordering result, S135, ordering current operator nodes according to a breadth priority traversal strategy to obtain a breadth ordering result, S136, based on the breadth ordering result, extracting reference operator nodes and operator nodes to be inserted from operator nodes with the same breadth, inserting operator nodes to be inserted and operator nodes with the same dependency relationship into corresponding positions of target operator nodes in the depth ordering results, and inserting sub-graph segmentation marks,