CN-121981202-A - Parallel processing method and device, electronic equipment and storage medium

CN121981202ACN 121981202 ACN121981202 ACN 121981202ACN-121981202-A

Abstract

The embodiment of the application discloses a parallel processing method, a parallel processing device, electronic equipment and a storage medium, which are used for improving the model operation efficiency and the hardware utilization rate of a processor. The parallel processing method comprises the steps of carrying out segmentation processing on input data in a target model to obtain a first slice data set and a second slice data set, respectively processing the first slice data set and the second slice data set through a parallel network module to obtain a first branch sequence and a second branch sequence, determining an inter-branch parallel operator set from the first branch sequence and the second branch sequence through dynamic programming, wherein the inter-branch parallel operator set comprises a first operator from the first branch sequence and a second operator from the second branch sequence, and executing the first operator and the second operator in parallel to obtain an operator execution result.

Inventors

WANG KAISHENG
Qin Tiancheng
SU TENG
YU FAN

Assignees

华为技术有限公司

Dates

Publication Date: 20260505
Application Date: 20241031

Claims (14)

1. A method of parallel processing, the method comprising: Performing segmentation processing on input data in a target model to obtain a first slice data set and a second slice data set; processing the first slice data set and the second slice data set through a parallel network module respectively to obtain a first branch sequence and a second branch sequence, wherein the first branch sequence and the branch sequence respectively comprise a plurality of operators; determining an inter-branch parallel operator set from the first branch sequence and the second branch sequence by dynamic programming, wherein the inter-branch parallel operator set comprises a first operator from the first branch sequence and a second operator from the second branch sequence; And executing the first operator and the second operator in parallel to obtain an operator execution result.
2. The method of claim 1, further comprising culling commonly accessed operators from the first branch sequence and the second branch sequence to obtain a first branch sequence after the operator is culled and a second branch sequence after the operator is culled; the determining, by dynamic programming, an inter-branch parallel operator set from the first branch sequence and the second branch sequence, including: And determining a branch-to-branch parallel operator set from the first branch sequence after the rejection operator and the second branch sequence after the rejection operator through dynamic programming.
3. The method of claim 2, wherein the culling commonly accessed operators from the first branch sequence and the second branch sequence comprises: and starting from a merging operator inserted into the target model, performing branch label propagation on the first branch sequence and the second branch sequence, and stopping performing branch label propagation on a segmentation operator inserted into the target model.
4. The method according to any one of claims 1 to 3, further comprising merging operators of the same type, which are consecutive in respective ones of the first branch sequence and the second branch sequence, to obtain a first branch sequence after merging the operators and a second branch sequence after merging the operators; the determining, by dynamic programming, an inter-branch parallel operator set from the first branch sequence and the second branch sequence, including: And determining a branch-to-branch parallel operator set from the first branch sequence after the merging operator and the second branch sequence after the merging operator through dynamic programming.
5. The method of claim 4, wherein merging operators of the same type that are consecutive within respective ones of the first and second branch sequences comprises: acquiring the operation time length of operators in each sequence in the first branch sequence and the second branch sequence; acquiring running duration sum of continuous operators of the same type in each sequence in the first branch sequence and the second branch sequence; and merging operators of the same type which are continuous in each sequence in the first branch sequence and the second branch sequence under the condition that the running time sum is smaller than a first time length threshold value.
6. The method of any of claims 1 to 5, wherein the determining an inter-branch parallelism operator set from the first and second sequences of branches by dynamic programming comprises: Acquiring first hiding time length for communication and calculation between the first (i-1) operators in the first branch sequence and the first j operators in the second branch sequence, wherein i and j are positive integers; acquiring second hiding time length for communication and calculation between the first i operators in the first branch sequence and the first (j-1) operators in the second branch sequence; Acquiring a third hiding time length for communication and calculation between a first (i-1) operator in the first branch sequence and a first (j-1) operator in the second branch sequence, and a fourth hiding time length for communication and calculation between an ith operator in the first branch sequence and a jth operator in the second branch sequence; obtaining a fifth hiding time length according to the first hiding time length, the second hiding time length, the third hiding time length and the fourth hiding time length, wherein the fifth hiding time length is the hiding time length for communication and calculation between the first i operators in the first branch sequence and the first j operators in the second branch sequence; Taking the value of the fifth hiding duration as the maximum duration as the optimal path of the dynamic programming, and determining an ith operator in the first branch sequence as the first operator and a jth operator in the second branch sequence as the second operator; Determining that the first operator and the second operator belong to the inter-branch parallel operator set.
7. The method of claim 6, wherein the fourth concealment duration is 0 if the operator types of the ith operator and the jth operator are the same.
8. The method of any of claims 1 to 7, wherein the executing the first operator and the second operator in parallel comprises: Adding a first control edge between the first operator and the second operator according to the inter-branch parallel operator set; And scheduling the first operator and the second operator to finish parallel execution according to the first control edge.
9. The method according to any one of claims 1 to 8, further comprising: And merging the operator execution results through the merging operators inserted into the target model to obtain merging results.
10. The method according to any one of claims 1 to 9, wherein the performing the segmentation process on the input data in the object model includes: And performing segmentation processing on the input data in the target model through a segmentation operator inserted into the target model.
11. A parallel processing apparatus, the apparatus comprising: the segmentation module is used for carrying out segmentation processing on the input data in the target model so as to obtain a first slice data set and a second slice data set; The data processing module is used for respectively processing the first slice data set and the second slice data set through the parallel network module to obtain a first branch sequence and a second branch sequence, wherein the first branch sequence and the branch sequence respectively comprise a plurality of operators; an inter-branch operator merging module for determining an inter-branch parallel operator set from the first branch sequence and the second branch sequence by dynamic programming, wherein the inter-branch parallel operator set comprises a first operator from the first branch sequence and a second operator from the second branch sequence; and the parallel execution module is used for executing the first operator and the second operator in parallel to obtain an operator execution result.
12. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.
13. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 10.
14. A chip comprising one or more interface circuits and one or more processors, the interface circuits to receive signals from a memory of an electronic device and to send the signals to the processor, the signals comprising computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 10.

Description

Parallel processing method and device, electronic equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a parallel processing method and device, an electronic device, and a storage medium. Background Aiming at a large-scale model in deep learning, the model has trillion-level parameter scale. Current artificial intelligence algorithms rely on neural networks, such as GPT-3 and GPT-4, with thousands of trillion parameters. However, large scale models have become extremely expensive due to the enormous computational cost. To address this problem, deep learning techniques have begun focusing on hybrid expert (Mixture of Experts, moE) architecture, providing a way to increase the number of model parameters without increasing the computational cost. The large-scale model of the MoE architecture is large in general parameter quantity, and model segmentation is required to be performed in parallel through additional experts. However, the expert parallel strategy introduces a time-consuming full-switched data communication (All-to-All communication, A2A) communication operator, and as the cluster size increases, the A2A communication duty ratio is continuously increased, which becomes a bottleneck of model operation. At present, a more efficient communication protocol and a communication network are adopted, so that the problem of large model operation amount is relieved to a certain extent, but the problem of low model operation efficiency and the problem of low hardware utilization rate of a processor still exist. Disclosure of Invention The embodiment of the application provides a parallel processing method, a parallel processing device, electronic equipment and a storage medium, which are used for improving the model operation efficiency and the hardware utilization rate of a processor. In order to solve the technical problems, the embodiment of the application provides the following technical scheme: In a first aspect, an embodiment of the present application provides a parallel processing method, which includes performing segmentation processing on input data in a target model to obtain a first slice data set and a second slice data set, processing the first slice data set and the second slice data set by a parallel network module to obtain a first branch sequence and a second branch sequence, where the first branch sequence and the branch sequence respectively include a plurality of operators, determining an inter-branch parallel operator set from the first branch sequence and the second branch sequence by dynamic programming, where the inter-branch parallel operator set includes a first operator from the first branch sequence and a second operator from the second branch sequence, and executing the first operator and the second operator in parallel to obtain an operator execution result. In the scheme, the input data in the target model is segmented to obtain the first slice data set and the second slice data set, the first slice data set and the second slice data set are respectively processed through the parallel network module to form two branch sequences which can be executed in parallel, and as most operators among the branch sequences have no dependency relationship, the first operators and the second operators which have no dependency relationship in the two branch sequences are scheduled in parallel, the calculation utilization rate of a processor is improved, and the time consumption of training and reasoning of the target model is reduced. In a possible implementation manner of the first aspect, the method further includes rejecting commonly accessed operators from the first branch sequence and the second branch sequence to obtain a first branch sequence after the operator is rejected and a second branch sequence after the operator is rejected; the determining, by dynamic programming, an inter-branch parallel operator set from the first branch sequence and the second branch sequence, including: And determining a branch-to-branch parallel operator set from the first branch sequence after the rejection operator and the second branch sequence after the rejection operator through dynamic programming. In the scheme, the commonly accessed operators are removed from the two branch sequences, so that the operators with the dependency relationship are removed from the two branch sequences, the fact that the dependency relationship does not exist when the operators are executed in parallel in the two branch sequences is ensured, and the situation of error in model operation is avoided. In a possible implementation manner of the first aspect, the removing the commonly accessed operator from the first branch sequence and the second branch sequence includes: and starting from a merging operator inserted into the target model, performing branch label propagation on the first branch sequence and the second branch sequence, and sto