US-20260127225-A1 - OPERATOR PROCESSING METHOD AND RELATED APPARATUS

US20260127225A1US 20260127225 A1US20260127225 A1US 20260127225A1US-20260127225-A1

Abstract

This application provides an operator processing method, including: obtaining topology information of a computation graph, where the topology information records a plurality of operators included in the computation graph, graph input data, graph output data, graph intermediate data, and a dependency relationship between the plurality of operators; and constructing an operator execution queue based on the topology information of the computation graph. Each element in the operator execution queue corresponds to one operator and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph intermediate data, the output data of the operator is at least one of the graph intermediate data or the graph output data, and an execution order of the plurality of operators is determined based on the dependency relationship between the plurality of operators.

Inventors

XIONG XIAO
Lian Deng
Xiaohui Guo

Assignees

HUAWEI TECHNOLOGIES CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20251229
Priority Date: 20230629

Claims (20)

1 . An operator processing method, comprising: obtaining topology information of a computation graph, wherein the topology information records a plurality of operators comprised in the computation graph, graph input data, graph output data, graph intermediate data, and a dependency relationship between the plurality of operators; and constructing an operator execution queue based on the topology information of the computation graph, wherein each element in the operator execution queue corresponds to one operator in the computation graph and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph intermediate data, the output data of the operator is at least one of the graph intermediate data or the graph output data, and an execution order of the plurality of operators in the operator execution queue is determined based on the dependency relationship between the plurality of operators.
2 . The method according to claim 1 , further comprising: numbering the graph input data, the graph output data, and the graph intermediate data by partition, wherein each element in the operator execution queue records one operator in the computation graph, a number of input data of the operator, and a number of output data of the operator.
3 . The method according to claim 1 , further comprising: determining a size of a graph workspace based on shape information of the graph input data and a correspondence between the operator corresponding to each element in the operator execution queue, the input data of the operator, and the output data of the operator; and applying for a memory as the graph workspace based on the size of the graph workspace.
4 . The method according to claim 3 , further comprising: obtaining shape information of the graph intermediate data and dependency information of the graph intermediate data, wherein the dependency information indicates an operator depending on the graph intermediate data; and determining an offset address of the graph intermediate data in the graph workspace and a size of the graph intermediate data based on the shape information of the graph intermediate data and the dependency information of the graph intermediate data.
5 . The method according to claim 4 , wherein the graph intermediate data comprises first graph intermediate data and second graph intermediate data, and when an operator depending on the first graph intermediate data is executed before an operator outputting the second graph intermediate data, an offset address of the first graph intermediate data is less than or equal to an offset address of the second graph intermediate data.
6 . The method according to claim 3 , further comprising: determining offset addresses of tiling parameters of the plurality of operators in the graph workspace and sizes of the tiling parameters of the plurality of operators; and loading the tiling parameters of the plurality of operators to the graph workspace based on the offset addresses of the tiling parameters of the plurality of operators in the graph workspace and the sizes of the tiling parameters of the plurality of operators.
7 . The method according to claim 3 , wherein the graph workspace is further provided with a scratch memory, and wherein the method further comprises: determining an offset address and a size of the scratch memory in the graph workspace based on a data amount of temporary data generated by execution of the plurality of operators.
8 . The method according to claim 7 , wherein scratch memories of the plurality of operators have a same offset address.
9 . The method according to claim 1 , wherein the computation graph comprises a first computation graph and a second computation graph, operators comprised in the first computation graph are the same as operators comprised in the second computation graph, a dependency relationship between the operators in the first computation graph is the same as a dependency relationship between the operators in the second computation graph, and shape information of data in the first computation graph is the same as shape information of data in the second computation graph, and wherein the method further comprises: when the first computation graph is executed, adjusting an address of the graph input data to an address of graph input data of the second computation graph.
10 . An operator processing apparatus, comprising: at least one storage storing computer-readable instructions; and at least one processor configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: obtaining topology information of a computation graph, wherein the topology information records a plurality of operators comprised in the computation graph, graph input data, graph output data, graph intermediate data, and a dependency relationship between the plurality of operators; and constructing an operator execution queue based on the topology information of the computation graph, wherein each element in the operator execution queue corresponds to one operator in the computation graph and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph intermediate data, the output data of the operator is at least one of the graph intermediate data or the graph output data, and an execution order of the plurality of operators in the operator execution queue is determined based on the dependency relationship between the plurality of operators.
11 . The apparatus according to claim 10 , wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: numbering the graph input data, the graph output data, and the graph intermediate data by partition, wherein each element in the operator execution queue records one operator in the computation graph, a number of input data of the operator, and a number of output data of the operator.
12 . The apparatus according to claim 10 , wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: determining a size of a graph workspace based on shape information of the graph input data and a correspondence between the operator corresponding to each element in the operator execution queue, the input data of the operator, and the output data of the operator; and applying for a memory as the graph workspace based on the size of the graph workspace.
13 . The apparatus according to claim 12 , wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: obtaining shape information of the graph intermediate data and dependency information of the graph intermediate data, wherein the dependency information indicates an operator depending on the graph intermediate data; and determining an offset address of the graph intermediate data in the graph workspace and a size of the graph intermediate data based on the shape information of the graph intermediate data and the dependency information of the graph intermediate data.
14 . The apparatus according to claim 13 , wherein the graph intermediate data comprises first graph intermediate data and second graph intermediate data, and when an operator depending on the first graph intermediate data is executed before an operator outputting the second graph intermediate data, an offset address of the first graph intermediate data is less than or equal to an offset address of the second graph intermediate data.
15 . The apparatus according to claim 10 , wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: determining offset addresses of tiling parameters of the plurality of operators in the graph workspace and sizes of the tiling parameters of the plurality of operators; and loading the tiling parameters of the plurality of operators to the graph workspace based on the offset addresses of the tiling parameters of the plurality of operators in the graph workspace and the sizes of the tiling parameters of the plurality of operators.
16 . The apparatus according to claim 12 , wherein the graph workspace is further provided with a scratch memory, and wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: determining an offset address and a size of the scratch memory in the graph workspace based on a data amount of temporary data generated by execution of the plurality of operators.
17 . The apparatus according to claim 16 , wherein scratch memories of the plurality of operators have a same offset address.
18 . The apparatus according to claim 10 , wherein the computation graph comprises a first computation graph and a second computation graph, operators comprised in the first computation graph are the same as operators comprised in the second computation graph, a dependency relationship between the operators in the first computation graph is the same as a dependency relationship between the operators in the second computation graph, and shape information of data in the first computation graph is the same as shape information of data in the second computation graph, and wherein the at least one processor is further configured to execute the computer-readable instructions to cause the operator processing apparatus to perform steps of: when the first computation graph is executed, adjusting an address of the graph input data to an address of graph input data of the second computation graph.
19 . A non-transitory computer-readable medium storing computer instructions, that when executed by one or more processors, cause the one or more processors to perform steps of: obtaining topology information of a computation graph, wherein the topology information records a plurality of operators comprised in the computation graph, graph input data, graph output data, graph intermediate data, and a dependency relationship between the plurality of operators; and constructing an operator execution queue based on the topology information of the computation graph, wherein each element in the operator execution queue corresponds to one operator in the computation graph and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph intermediate data, the output data of the operator is at least one of the graph intermediate data or the graph output data, and an execution order of the plurality of operators in the operator execution queue is determined based on the dependency relationship between the plurality of operators.
20 . The non-transitory computer-readable medium according to claim 19 , wherein the one or more processors further execute the computer instructions to cause the one or more processors to perform steps of: numbering the graph input data, the graph output data, and the graph intermediate data by partition, wherein each element in the operator execution queue records one operator in the computation graph, a number of input data of the operator, and a number of output data of the operator.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of International Application No. PCT/CN2024/083231, filed on March 22, 2024, which claims priority to Chinese Patent Application No. 202311281657.5, filed on September 27, 2023, and Chinese Patent Application No. 202310792403.3, filed on June 29, 2023. All of the aforementioned patent applications are hereby incorporated by reference in their entireties. TECHNICAL FIELD This application relates to the field of computer technologies, and in particular, to an operator processing method and apparatus, a computer-readable storage medium, and a computer program product. BACKGROUND An operator is a computing unit. For example, a binary operator is usually a computing unit that exists in a binary encoding form. The operator usually includes host logic and device logic. A host includes a central processing unit (CPU) and a host memory, and a device includes an external processing unit and a device memory. The external processing unit may be a neural network processing unit (NPU) or a graphics processing unit (GPU). Different types of processors may implement different types of instruction set architectures. Operator execution usually involves interaction between a host side and a device side. If input data of the operator is already in the device memory, main time consumption includes the following: The host side derives an output shape based on an input shape, applies for a memory on the device side, calculates a tiling parameter of the operator, transfers the tiling data to the device side, and delivers a computing task to the device side, and the device side executes the computing task. In many service scenarios, for example, in a scenario of training a neural network, or in a scenario of performing inference based on a trained neural network, a plurality of operators are usually executed. When a plurality of operators are executed, a computation graph including the plurality of operators may be compiled into an intermediate representation (IR) by using a compilation method, for example, by using a tensor virtual machine (TVM) or accelerated linear algebra (XLA), where the IR is used to represent a description of an execution process, then an execution graph is generated, and operators are invoked in batches by using the execution graph. However, in the method, a complex compilation optimization process is required, and the execution graph needs to be generated offline. The method is a heavyweight operator scheduling technology, and is inapplicable to some scenarios in which lightweight batch operator invoking is required, for example, is inapplicable to a scenario of implementing an acceleration library application programming interface (API). SUMMARY This application provides an operator processing method. In the method, an operator execution queue is constructed based on topology information of a computation graph. In this way, an execution graph is constructed in a form of the operator execution queue. The method can be flexibly applied to all scenarios in which operators need to be scheduled in batches, so that a service requirement is met. This application further provides an operator processing apparatus, a computer-readable storage medium, and a computer program product that correspond to the operator processing method. According to a first aspect, this application provides an operator processing method. The method may be performed by an operator processing apparatus. The operator processing apparatus may be a software apparatus. The software apparatus may be an independent software package or a cloud service such as a cloud service provided by using an interface (for example, a graph construction interface), and supports a user in constructing a graph in a user-defined manner based on operators, to represent a topology graph in a form of a queue. The software apparatus may be deployed in a computing device cluster (for example, a computing device executing an operator or a third-party computing device cluster), and the computing device cluster executes program code of the software apparatus to perform the operator processing method in this application. The operator processing apparatus may alternatively be a hardware apparatus, for example, a computing device or a computing device cluster that has an operator processing function. When the operator processing apparatus runs, the operator processing method in this application is performed. Specifically, the operator processing apparatus obtains topology information of a computation graph, where the topology information records a plurality of operators included in the computation graph, graph input data, graph output data, graph intermediate data, and a dependency relationship between the plurality of operators; and then constructs an operator execution queue based on the topology information of the computation graph. Each element in the operator execution queue corresponds to one o