CN-121996397-A - Data processing method and device

CN121996397ACN 121996397 ACN121996397 ACN 121996397ACN-121996397-A

Abstract

A data processing method comprises the steps of obtaining a plurality of different substructures, wherein each substructure indicates a relation between a data processing strategy of each computing unit and a data processing strategy among different computing units when a plurality of computing units execute a machine learning model through pipeline parallelism, each substructure is a segment of a complete execution process of the pipeline parallelism for the machine learning model, determining an optimal target substructure from the plurality of substructures through linear programming, constructing a global processing strategy according to the target substructures, wherein the global processing strategy is the data processing strategy of the complete execution process, and executing the machine learning model through the plurality of computing units according to the global processing strategy. The solving process is carried out on the segment of the global processing strategy, so that the solving complexity can be reduced under the condition of ensuring the solving precision.

Inventors

ZHOU HUOZHI
GUO KAIYANG

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241106

Claims (18)

1. A method of data processing, the method comprising: Acquiring a plurality of different substructures, wherein each substructure indicates a data processing strategy of each computing unit and a relation of the data processing strategies among different computing units when a plurality of computing units execute a machine learning model through pipeline parallelism, and each substructure is a segment of a complete execution process of the machine learning model in the pipeline parallelism; Determining an optimal target substructure from the plurality of substructures through linear programming, and constructing a global processing strategy according to the target substructure, wherein the global processing strategy is the data processing strategy of the complete execution process; and executing the machine learning model through the plurality of computing units according to the global processing strategy.
2. The method according to claim 1, wherein the segment is specifically a loop unit of the pipeline parallel complete execution process by the machine learning model, the complete execution process being implemented by loop executing the loop unit.
3. The method according to claim 1 or 2, wherein said determining an optimal target substructure from said plurality of substructures by linear programming comprises: determining a partial substructure from the plurality of substructures as a candidate for an optimal target substructure; by linear programming, an optimal target substructure is determined from the partial substructure.
4. A method according to any one of claims 1 to 3, wherein said determining a partial substructure from said plurality of substructures as a candidate for an optimal target substructure comprises: and determining partial substructures from the plurality of substructures as candidates of the optimal target substructures according to the performance of each substructures.
5. The method of claim 4, wherein the plurality of computing devices includes a first computing device deployed with a first network layer of the machine learning model and a second computing device deployed with a second network layer of the machine learning model, the second network layer connected after the first network layer, the performance manifestation comprising: in the relationship of the data processing strategies among different computing units indicated by the sub-structure, when the data of the same batch of batch is processed, whether the data is processed by the first network layer on the first computing device is processed by the second network layer on the second computing device.
6. The method of claim 4, wherein the performance manifestation comprises: the data processing policy of each computing unit indicated by the substructure is an amount of time that the computing unit is idle.
7. The method according to any one of claims 1 to 6, wherein said determining an optimal target substructure from said plurality of substructures by linear programming comprises: an optimal target substructure is determined from the plurality of substructures by mixed integer linear programming.
8. A data processing apparatus, the apparatus comprising: The acquisition module is used for acquiring a plurality of different substructures, each substructure indicates a data processing strategy of each computation unit and a relation of the data processing strategies among different computation units when the plurality of computation units perform execution of the machine learning model through pipeline parallelism, and each substructure is a segment of the complete execution process of the machine learning model in the pipeline parallelism; the optimization module is used for determining an optimal target substructure from the plurality of substructures through linear programming, and constructing a global processing strategy according to the target substructure, wherein the global processing strategy is the data processing strategy of the complete execution process; and the model executing module is used for executing the machine learning model through the plurality of computing units according to the global processing strategy.
9. The apparatus according to claim 8, wherein said segment is specifically a loop unit of said performing machine learning model for performing a complete execution of said pipeline parallelism, said complete execution being achieved by executing said loop unit in a loop.
10. The apparatus according to claim 8 or 9, wherein the optimization module is specifically configured to: determining a partial substructure from the plurality of substructures as a candidate for an optimal target substructure; by linear programming, an optimal target substructure is determined from the partial substructure.
11. The apparatus according to any one of claims 8 to 10, wherein the optimization module is specifically configured to: and determining partial substructures from the plurality of substructures as candidates of the optimal target substructures according to the performance of each substructures.
12. The apparatus of claim 11, wherein the plurality of computing devices comprises a first computing device deployed with a first network layer of the machine learning model and a second computing device deployed with a second network layer of the machine learning model, the second network layer connected after the first network layer, the performance manifestation comprising: in the relationship of the data processing strategies among different computing units indicated by the sub-structure, when the data of the same batch of batch is processed, whether the data is processed by the first network layer on the first computing device is processed by the second network layer on the second computing device.
13. The apparatus of claim 11, wherein the performance manifestation comprises: the data processing policy of each computing unit indicated by the substructure is an amount of time that the computing unit is idle.
14. The apparatus according to any one of claims 8 to 13, wherein the optimization module is specifically configured to: an optimal target substructure is determined from the plurality of substructures by mixed integer linear programming.
15. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 7.
16. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 7.
17. A system comprises at least one processor and at least one memory, wherein the processor and the memory are connected through a communication bus and complete communication with each other; The at least one memory is used for storing codes; The at least one processor is configured to execute the code to perform the method of any of claims 1 to 7.
18. A chip comprising at least one processing unit and interface circuitry for providing program instructions or data to the at least one processing unit, the at least one processing unit being adapted to execute the program instructions to implement the method of any one of claims 1 to 7.

Description

Data processing method and device Technical Field The present application relates to the field of artificial intelligence, and in particular, to a communication system, a data processing method, and an apparatus thereof. Background Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. With the increase of computing power of computing equipment, the broadband improvement of a communication network and the data accumulation of a large data network, a large-scale neural network model based on the neural network is being researched and deployed on a distributed network to perform computation and processing so as to realize the application of large-scale parallel data processing, distributed storage, elastic topology, high redundancy, nonlinear operation and the like of cloud computing. How to optimally distribute training tasks of large models to different computing cards and to ensure the stability of the training is a great challenge. To address the various challenges of distributed training, many algorithms and system solutions are proposed in industry and academia, the most important of which is the allocation of tasks in which dimensions and how to achieve optimal allocation in fixed dimensions. The distribution dimension of the main stream in the industry is data parallel, tensor parallel, pipeline parallel and the like. The pipeline is parallel to distribute parameters of different layers to different computing units to solve the problem of insufficient memory, and meanwhile, compared with tensor parallel pipelines, the pipeline only needs point-to-point communication, and the communication overhead is small. When the pipeline parallel operation is performed, the data processing strategies of each computing unit need to be predetermined, where the data processing strategies may include a data processing sequence of each computing device for each batch of data, and a processing sequence of each computing unit for the same batch of data, where the data processing strategies need to be solved by a linear optimization mode, in the prior art, in order to ensure the solution accuracy, the global data processing strategy is often solved, that is, the data processing strategy in the complete execution process of the machine learning model is solved, which may result in higher computation complexity, and with the increase of the scale of the problem, commercial solvers on the market may not be able to complete the solution. Disclosure of Invention In a first aspect, the application provides a data processing method, which comprises the steps of obtaining a plurality of different substructures, wherein each substructure indicates a relation between a data processing strategy of each computing unit and a data processing strategy among different computing units when a plurality of computing units execute a machine learning model through pipeline parallelism, each substructure executes a fragment of a complete execution process of the pipeline parallelism for the machine learning model, determining an optimal target substructure from the plurality of substructures through linear programming, and constructing a global processing strategy according to the target substructures, wherein the global processing strategy is the data processing strategy of the complete execution process, and executing the machine learning model through the plurality of computing units according to the global processing strategy. In the embodiment of the application, when determining the data processing strategy for executing the machine learning model through pipeline parallelism, instead of optimally solving the global situation, considering that the global data processing strategy is often formed by stacking (e.g. cycling) a plurality of units (i.e. substructures), in the embodiment of the application, starting from the substructures, selecting a better substructures (target substructures) from a plurality of different substructures through optimally solving, and then executing the machine learning model through the global processing strategy formed according to the target substructures. Because the solving process is performed on the segment of the global processing strategy, the solving complexity can be reduced under the condition of ensuring the s