CN-121233319-B - Task allocation method, device, equipment, medium and product of multi-core processor

CN121233319BCN 121233319 BCN121233319 BCN 121233319BCN-121233319-B

Abstract

The invention provides a task allocation method, device, equipment, medium and product of a multi-core processor, wherein the method comprises the steps of establishing calculation constraint of each target operator in multi-core calculation, setting a network initial state, dynamically transmitting layout labels of tensors based on the execution sequence of each target operator, dynamically adjusting the layout labels by combining with the calculation constraint of a current operator to generate a layout path combination and a potential optimal primitive set which meet the requirement of multi-core calculation, constructing a network diagram of multi-core calculation by utilizing the layout path combination, searching a target path in the network diagram based on PATHFINDER algorithm, mapping the layout labels of the target paths to corresponding tensors, and ensuring that corresponding primitives are activated according to the layout labels when each target operator is executed. Therefore, the frame parallelization algorithm and the light feather chip L3 architecture depth are coordinated, a cross-layer/chip data carrying path is optimized in a 3D stacked memory (such as an HBM/3D-IC), L3 data access power consumption can be remarkably reduced, and running time can be shortened.

Inventors

Request for anonymity
Request for anonymity

Assignees

上海光羽芯辰科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250926

Claims (10)

1. A method for task allocation for a multi-core processor, the method comprising: Establishing calculation constraint of each target operator in multi-core calculation, and setting a network initial state; Dynamically transmitting a layout label of tensor based on the execution sequence of each target operator, and dynamically adjusting the layout label by combining with the calculation constraint of the current operator to generate a layout path combination and a potential optimal primitive set which meet the multi-core calculation requirement, wherein each layout path is used for representing the collaborative mapping relation between a tensor layout conversion sequence and a target operator execution sequence; constructing a network diagram of multi-core calculation by utilizing the layout path combination, and searching a target path in the network diagram based on PATHFINDER algorithm, wherein the target path is used for indicating that a candidate path with relatively low cost is searched in the network diagram through PATHFINDER algorithm; Mapping the layout labels of the target paths to corresponding tensors, and enabling corresponding primitives to be activated according to the layout labels when each target operator is executed.
2. The method of claim 1, wherein establishing computational constraints for each target operator in the multi-core computation comprises: And constructing three-level segmentation constraint of a binary operator aiming at binary operation in multi-core calculation, wherein the three-level segmentation constraint refers to that a first input matrix, a second input matrix and an output matrix of the binary operator are directly distributed to a multi-core processor for parallel calculation after being segmented along the same data dimension and with the same block size.
3. The method of claim 1, wherein establishing computational constraints for each target operator in the multi-core computation comprises: And constructing a first calculation constraint or a second calculation constraint of the matrix multiplier aiming at matrix multiplication in multi-core calculation, wherein the first calculation constraint is used for keeping complete copies of a first input matrix of the matrix multiplier in all cores, dividing a second input matrix and an output matrix along the column dimension of the first input matrix and the output matrix by the same block size, and distributing the second input matrix and the output matrix to the multi-core processor for parallel calculation, and the second calculation constraint is used for keeping complete copies of a second input matrix of the matrix multiplier in all cores, dividing the first input matrix and the output matrix along the row dimension of the first input matrix and the output matrix by the same block size, and distributing the first input matrix and the output matrix to the multi-core processor for parallel calculation.
4. The method of claim 1, wherein the dynamically propagating tensor layout labels based on the order of execution of each of the target operators comprises: According to the execution sequence of each target operator in the network, the output tensor layout label of the leading target operator is transmitted to the target operator which is dependent on the output tensor subsequently along the data dependent chain.
5. The method of claim 1 or 4, wherein the dynamically adjusting the layout labels in conjunction with the computing constraints generates a set of layout path combinations and potentially optimal primitives that meet multi-core computing needs, comprising: generating all feasible layout combinations of each target operator according to the calculation constraint of each target operator and the layout label of the corresponding tensor; traversing feasible layout combinations of all target operators, generating a plurality of cross-operator layout paths, and screening out layout path combinations meeting the multi-core computing requirements; And carrying out multi-dimensional performance analysis on the layout path combination, and reserving a set of potential optimal primitives.
6. The method of claim 1, wherein mapping the layout labels of the target paths to corresponding tensors ensures that each of the target operators, when executed, activates a corresponding primitive in accordance with the layout label, comprises: binding expected layout labels of the target paths to compile-time metadata of corresponding tensors; checking whether a current layout label of a tensor is consistent with the expected layout label when the target operator is executed; and if so, activating a corresponding primitive according to the expected layout label, and calling core calculation logic of the primitive to generate a final result.
7. A task allocation apparatus for a multi-core processor, the apparatus comprising: the building module is used for building calculation constraint of each target operator in multi-core calculation and setting network initial state; The generating module is used for dynamically transmitting the layout labels of tensors based on the execution sequence of each target operator and dynamically adjusting the layout labels by combining with the calculation constraint of the current operator to generate a layout path combination and a potential optimal primitive set which meet the multi-core calculation requirement, wherein each layout path is used for representing the collaborative mapping relation between a tensor layout conversion sequence and a target operator execution sequence; the searching module is used for constructing a network diagram of multi-core calculation by utilizing the layout path combination and searching a target path in the network diagram based on PATHFINDER algorithm, wherein the target path is used for indicating that a candidate path with lower cost is searched in the network diagram through PATHFINDER algorithm; And the execution module is used for mapping the layout labels of the target paths to corresponding tensors, and enabling corresponding primitives to be activated according to the layout labels when each target operator executes.
8. An electronic device comprising a memory for storing a computer program and a processor for executing the computer program stored in the memory to cause the processor to perform the steps of the method according to any one of claims 1 to 6.
9. A computer readable storage medium, characterized in that it has stored thereon a program which, when executed by a processor, is adapted to carry out the steps of the method according to any of claims 1 to 6.
10. A computer program product comprising computer program code means for causing a computer to carry out the steps of the method as claimed in any one of claims 1 to 6 when said computer program code means is run on the computer.

Description

Task allocation method, device, equipment, medium and product of multi-core processor Technical Field The present invention relates to the field of computer multi-core processors, and in particular, to a task allocation method, apparatus, device, medium and product for a multi-core processor. Background Although the automatic data flow generation technology has made important progress in multi-core task division, scheduling strategies, load balancing, fusion with intelligent algorithms and energy efficiency optimization technologies and the like, the current technology still faces the following core problems that a task model is simplified and algorithm complexity is too high, the load balancing has obvious uncertainty, support of hardware on energy efficiency optimization is insufficient, an energy consumption model is not comprehensive enough, the serialization problem is difficult to solve, and the current technology still faces challenges in hardware environment adaptation and software application compatibility. Accordingly, there is a need for an efficient task allocation method for multi-core processors that addresses the above-described issues. Disclosure of Invention In view of the foregoing, the present invention has been developed to provide a method, apparatus, device, medium, and article for task allocation for a multi-core processor that overcome, or at least partially solve, the foregoing problems. To achieve the above and other related objects, the present invention provides a task allocation method of a multi-core processor, the method including: Establishing calculation constraint of each target operator in multi-core calculation, and setting a network initial state; Dynamically transmitting a layout label of tensor based on the execution sequence of each target operator, and dynamically adjusting the layout label by combining with the calculation constraint of the current operator to generate a layout path combination and a potential optimal primitive set which meet the multi-core calculation requirement, wherein each layout path is used for representing the collaborative mapping relation between a tensor layout conversion sequence and a target operator execution sequence; constructing a network diagram of multi-core calculation by utilizing the layout path combination, and searching a target path in the network diagram based on PATHFINDER algorithm, wherein the target path is used for indicating that a candidate path with lower cost is searched in the network diagram through PATHFINDER algorithm; Mapping the layout labels of the target paths to corresponding tensors, and enabling corresponding primitives to be activated according to the layout labels when each target operator is executed. Optionally, the establishing a computation constraint of each target operator in the multi-core computation includes: And constructing three-level segmentation constraint of a binary operator aiming at binary operation in multi-core calculation, wherein the three-level segmentation constraint refers to that a first input matrix, a second input matrix and an output matrix of the binary operator are directly distributed to a multi-core processor for parallel calculation after being segmented along the same data dimension and with the same block size. Optionally, the establishing a computation constraint of each target operator in the multi-core computation includes: And constructing a first calculation constraint or a second calculation constraint of the matrix multiplier aiming at matrix multiplication in multi-core calculation, wherein the first calculation constraint is used for keeping complete copies of a first input matrix of the matrix multiplier in all cores, dividing a second input matrix and an output matrix along the column dimension of the first input matrix and the output matrix by the same block size, and distributing the second input matrix and the output matrix to the multi-core processor for parallel calculation, and the second calculation constraint is used for keeping complete copies of a second input matrix of the matrix multiplier in all cores, dividing the first input matrix and the output matrix along the row dimension of the first input matrix and the output matrix by the same block size, and distributing the first input matrix and the output matrix to the multi-core processor for parallel calculation. Optionally, the layout label for dynamically propagating tensors based on the execution sequence of each target operator includes: According to the execution sequence of each target operator in the network, the output tensor layout label of the leading target operator is transmitted to the target operator which is dependent on the output tensor subsequently along the data dependent chain. Optionally, the dynamically adjusting the layout label in combination with the computation constraint generates a layout path combination and a set of potential optimal primitives that meet the multi