US-12619408-B2 - Method and apparatus for pipeline parallelism compiling

US12619408B2US 12619408 B2US12619408 B2US 12619408B2US-12619408-B2

Abstract

A method for pipeline parallelism compiling is provided, which is executed by one or more processors, and includes receiving a source program associated with training of a machine learning model, determining, based on the source program, a plurality of operation groups including operations executed on training data of the machine learning model, generating a plurality of micro-batches from the training data, and determining, for each of the plurality of micro-batches, a plurality of operation sets corresponding to the plurality of operation groups.

Inventors

Gangwon Jo
Jungho Park

Assignees

MOREH CORP.

Dates

Publication Date: 20260505
Application Date: 20240130
Priority Date: 20230306

Claims (20)

1 . A method executed by one or more processors, the method comprising: receiving a source program associated with training of a machine learning model; receiving, based on the source program, a plurality of operation groups comprising operations executed on training data of the machine learning model, wherein the plurality of operation groups comprises: one or more forward propagation operation groups associated with a forward propagation process of the training data of the machine learning model, and one or more backward propagation operation groups associated with a backward propagation process of the training data of the machine learning model; generating a plurality of batches from the training data of the machine learning model, wherein each batch of the plurality of batches corresponds to a portion of the training data of the machine learning model; outputting, for each batch of the plurality of batches, a plurality of operation sets corresponding to the plurality of operation groups comprising: one or more forward propagation operation sets associated with the one or more forward propagation operation groups, and one or more backward propagation operation sets associated with the one or more backward propagation operation groups; determining an accelerator of a plurality of accelerators allocated to a corresponding operation set of the plurality of operation sets; determining a processing sequence of the corresponding operation set of the plurality of operation sets such that each operation set of the plurality of operation sets is processed in one accelerator of the plurality of accelerators; and executing a sequential process of each forward propagation operation set of a plurality of forward propagation operation sets associated with a specific batch of the plurality of batches in different accelerators of the plurality of accelerators according to a first order.
2 . The method according to claim 1 , further comprising: executing a sequential process of each backward propagation operation set of a plurality of backward propagation operation sets associated with the specific batch of the plurality of batches in the different accelerators of the plurality of accelerators according to a reverse order of the first order.
3 . The method according to claim 1 , further comprising: determining a processing time of the one or more backward propagation operation sets associated with the specific batch of the plurality of batches that starts after completion of processing of each forward propagation operation set of the one or more forward propagation operation sets associated with the specific batch of the plurality of batches is completed.
4 . The method according to claim 1 , further comprising: determining that the one or more forward propagation operation sets and the one or more backward propagation operation sets associated with the plurality of batches are to be processed to cross each other a maximum number of times in the plurality of accelerators.
5 . The method according to claim 1 , wherein the determining the plurality of operation groups comprises: determining a plurality of processing times for the plurality of operation groups for same training data of the machine learning model; and determining the plurality of operation groups such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.
6 . The method according to claim 1 , wherein the generating the plurality of batches comprises: determining a plurality of processing times for the plurality of operation sets associated with the plurality of batches; and generating the plurality of batches such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.
7 . The method according to claim 1 , wherein a size of each batch of the plurality of batches is based on a quantity of accelerators of the plurality of accelerators.
8 . A computing device comprising: one or more processors; and a memory storing instructions that, when executed by one or more processors, cause the computing device to: receive a source program associated with training of a machine learning model; receive, based on the source program, a plurality of operation groups comprising operations executed on training data of the machine learning model, wherein the plurality of operation groups comprises: one or more forward propagation operation groups associated with a forward propagation process of the training data of the machine learning model, and one or more backward propagation operation groups associated with a backward propagation process of the training data of the machine learning model; generate a plurality of batches from the training data of the machine learning model, wherein each batch of the plurality of batches corresponds to a portion of the training data of the machine learning model; output, for each batch of the plurality of batches, a plurality of operation sets corresponding to the plurality of operation groups comprising: one or more forward propagation operation sets associated with the one or more forward propagation operation groups, and one or more backward propagation operation sets associated with the one or more backward propagation operation groups; determine an accelerator of a plurality of accelerators allocated to a corresponding operation set of the plurality of operation sets; determine a processing sequence of the corresponding operation set of the plurality of operation sets such that each operation set of the plurality of operation sets is processed in one accelerator of the plurality of accelerators; and execute a sequential process of each forward propagation operation set of a plurality of forward propagation operation sets associated with a specific batch of the plurality of batches in different accelerators of the plurality of accelerators according to a first order.
9 . The computing device according to claim 8 , wherein the instructions, when executed by the one or more processors, further cause the computing device to: execute a sequential process of each backward propagation operation set of a plurality of backward propagation operation sets associated with the specific batch of the plurality of batches in the different accelerators of the plurality of accelerators according to a reverse order of the first order.
10 . The computing device according to claim 8 , wherein the instructions, when executed by the one or more processors, further cause the computing device to: determine a processing time of the one or more backward propagation operation sets associated with the specific batch of the plurality of batches that starts after completion of processing of each forward propagation operation set of the one or more forward propagation operation sets associated with the specific batch of the plurality of batches is completed.
11 . The computing device according to claim 8 , wherein the instructions, when executed by the one or more processors, further cause the computing device to: determine that the one or more forward propagation operation sets and the one or more backward propagation operation sets associated with the plurality of batches are to be processed to cross each other a maximum number of times in the plurality of accelerators.
12 . The computing device according to claim 8 , wherein the instructions, when executed by the one or more processors, further cause the computing device to determine the plurality of operation groups by causing the computing device to: determine a plurality of processing times for the plurality of operation groups for same training data of the machine learning model; and determine the plurality of operation groups such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.
13 . The computing device according to claim 8 , wherein the instructions, when executed by the one or more processors, further cause the computing device to generate the plurality of batches by causing the computing device to: determine a plurality of processing times for the plurality of operation sets associated with the plurality of batches; and generate the plurality of batches such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.
14 . The computing device according to claim 8 , wherein a size of each batch of the plurality of batches is based on a quantity of accelerators of the plurality of accelerators.
15 . One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors of a computing device, cause the computing device to: receive a source program associated with training of a machine learning model; receive, based on the source program, a plurality of operation groups comprising operations executed on training data of the machine learning model, wherein the plurality of operation groups comprises: one or more forward propagation operation groups associated with a forward propagation process of the training data of the machine learning model, and one or more backward propagation operation groups associated with a backward propagation process of the training data of the machine learning model; generate a plurality of batches from the training data of the machine learning model, wherein each batch of the plurality of batches corresponds to a portion of the training data of the machine learning model; output, for each batch of the plurality of batches, a plurality of operation sets corresponding to the plurality of operation groups comprising: one or more forward propagation operation sets associated with the one or more forward propagation operation groups, and one or more backward propagation operation sets associated with the one or more backward propagation operation groups; determine an accelerator of a plurality of accelerators allocated to a corresponding operation set of the plurality of operation sets; determine a processing sequence of the corresponding operation set of the plurality of operation sets such that each operation set of the plurality of operation sets is processed in one accelerator of the plurality of accelerators; and execute a sequential process of each forward propagation operation set of a plurality of forward propagation operation sets associated with a specific batch of the plurality of batches in different accelerators of the plurality of accelerators according to a first order.
16 . The one or more non-transitory computer-readable media according to claim 15 , wherein the instructions, when executed by the one or more processors of the computing device, further cause the computing device to: execute a sequential process of each backward propagation operation set of a plurality of backward propagation operation sets associated with the specific batch of the plurality of batches in the different accelerators of the plurality of accelerators according to a reverse order of the first order.
17 . The one or more non-transitory computer-readable media according to claim 15 , wherein the instructions, when executed by the one or more processors of the computing device, further cause the computing device to: determine a processing time of the one or more backward propagation operation sets associated with the specific batch of the plurality of batches that starts after completion of processing of each forward propagation operation set of the one or more forward propagation operation sets associated with the specific batch of the plurality of batches is completed.
18 . The one or more non-transitory computer-readable media according to claim 15 , wherein the instructions, when executed by the one or more processors of the computing device, further cause the computing device to: determine that the one or more forward propagation operation sets and the one or more backward propagation operation sets associated with the plurality of batches are to be processed to cross each other a maximum number of times in the plurality of accelerators.
19 . The one or more non-transitory computer-readable media according to claim 15 , wherein the instructions, when executed by the one or more processors of the computing device, further cause the computing device to determine the plurality of operation groups by causing the computing device to: determine a plurality of processing times for the plurality of operation groups for same training data of the machine learning model; and determine the plurality of operation groups such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.
20 . The one or more non-transitory computer-readable media according to claim 15 , wherein the instructions, when executed by the one or more processors of the computing device, further cause the computing device to generate the plurality of batches by causing the computing device to: determine a plurality of processing times for the plurality of operation sets associated with the plurality of batches; and generate the plurality of batches such that a difference between each processing time of the plurality of processing times is less than a predetermined threshold value.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority under 35 U.S.C § 119 to Korean Patent Application No. 10-2023-0029483 and No. 10-2023-0087162, filed in the Korean Intellectual Property Office on Mar. 6, 2023 and Jul. 5, 2023, respectively, the entire contents of which are hereby incorporated by reference. BACKGROUND Technical Field The present disclosure relates to a method and apparatus for pipeline parallelism compiling, and specifically, to a method and apparatus for parallel processing a plurality of operation sets corresponding to a plurality of operation groups including operations executed on training data of a machine learning model based on a source program for each of a plurality of micro-batches generated from the training data. Description of the Related Art A compiler is a language translation program that converts codes written in a specific programming language into another language (e.g., machine language) that can be read by a computer processor. A general compiler performs the process of converting a specific programming language into another language by sequentially analyzing the vocabulary, syntax, and semantics of a source program, generating an intermediate representation such as intermediate code, optimizing the code, and then generating an object code. In the field of compiler technology, technological advances have been made to improve the speed and efficiency of target programs by optimizing this conversion process. Meanwhile, training deep learning models requires considerable computing resources. Parallel computing technology is widely used to overcome the limitations in terms of data processing speed and available memory when training a model on a single device. The parallel computing is a computing method in which multiple processing units participate in problem solving at the same time to quickly complete a given task, and is widely used in various fields (e.g., machine learning, image processing, etc.) that require high-performance computing, complex problem solving, and large amount of data processing, and is attracting attention as one of the most powerful paradigms in computer architecture. However, in the existing general techniques used in pytorch and the like to train models according to parallel computing, additional efforts are required during programming, such as requiring the user to directly determine each stage or micro-batch of the pipeline and explicitly inserting a communication process between devices. The library or the like may be used to reduce such effort, but there still is inconvenience that the user has to perform additional settings in consideration of the current system. SUMMARY In order to solve the problems described above, the present disclosure provides a method, recording medium, and system (apparatus) for pipeline parallelism compiling. The present disclosure may be implemented in a variety of ways, including a method, a system (device), or a computer program stored in a readable storage medium. A method for pipeline parallelism compiling is provided, which may be executed by one or more processors and include receiving a source program associated with training of a machine learning model, determining, based on the source program, a plurality of operation groups including operations executed on training data of the machine learning model, generating a plurality of micro-batches from the training data, and determining, for each of the plurality of micro-batches, a plurality of operation sets corresponding to the plurality of operation groups. The method may further include determining an accelerator allocated with each of the plurality of operation sets and a processing sequence thereof such that each of the plurality of operation sets is processed in one of a plurality of accelerators. The plurality of operation groups may include one or more forward propagation operation groups associated with a forward propagation process of the training data, and one or more backward propagation operation groups associated with a backward propagation process, and the plurality of operation sets may include one or more forward propagation operation sets associated with the one or more forward propagation operation groups, and one or more backward propagation operation sets associated with the one or more backward propagation operation groups. The determining the accelerator allocated with each of the plurality of operation sets and the processing sequence thereof may include determining that each of a plurality of forward propagation operation sets associated with a specific micro-batch be sequentially processed in different accelerators from each other. The determining the accelerator allocated with each of the plurality of operation sets and the processing sequence thereof may further include determining that each of a plurality of backward propagation operation sets associated with the specific micro-batch be sequentially processed i