CN-122002164-A - Cross-data center computing and optical network joint scheduling method and device for hybrid parallel training

CN122002164ACN 122002164 ACN122002164 ACN 122002164ACN-122002164-A

Abstract

The application provides a cross-data center computing and optical network joint scheduling method and equipment for hybrid parallel training, wherein the method comprises the steps of determining data parallel gradient synchronous traffic, computing time of each pipeline stage on a single micro batch, tensor parallel traffic and pipeline parallel traffic between adjacent stages by adopting a time model formula according to a model structure of a training task, hybrid parallel configuration and pipeline stage division; and (3) with the aim of minimizing the total time of single iteration, under the constraint of calculation and an optical network, determining the deployment position of the data center, the calculation resources and the optical network path and bandwidth of cross-data center communication of each pipeline stage by joint optimization so as to perform deployment and optical channel establishment. The method and the device can solve the problem of low training efficiency caused by mutual independence and lack of cooperation of calculation scheduling and optical network resource allocation in cross-data center training, realize cooperative optimization of calculation and optical network resources, effectively reduce training time delay and improve resource utilization efficiency.

Inventors

YIN SHAN
CAI MENGRU
HUANG SHANGUO
LIU XIAODONG
LI YIAN
HE ZIXIN

Assignees

北京邮电大学

Dates

Publication Date: 20260508
Application Date: 20260305

Claims (10)

1. The cross-data center computing and optical network joint scheduling method for hybrid parallel training is characterized by comprising the following steps of: According to the model structure, the mixed parallel configuration and the division scheme of pipeline stages of the mixed parallel training task to be scheduled, a preset time model formula is adopted to determine the data parallel gradient synchronous communication data volume corresponding to the mixed parallel training task, the calculation time of each pipeline stage on a single micro batch, the tensor parallel communication data volume inside each pipeline stage and the pipeline parallel communication data volume between adjacent pipeline stages; Based on the calculation time, the tensor parallel communication data quantity, the pipeline parallel communication data quantity and the data parallel gradient synchronous communication data quantity, aiming at minimizing the single iteration total time of the hybrid parallel training task, and under the condition of meeting the pre-acquired calculation resource constraint and the optical network resource constraint, determining a joint optimization decision result corresponding to the hybrid parallel training task, wherein the joint optimization decision result comprises a data center deployment position of each pipeline stage, calculation resources distributed to each pipeline stage and optical network paths and bandwidth resources distributed for cross-data center communication; based on the joint optimization decision result, corresponding pipeline stages are deployed in each data center, computing resources are allocated, and corresponding optical channels are established for cross-data center communication in an optical network.
2. The hybrid parallel training-oriented cross-data center computing and optical network joint scheduling method according to claim 1, wherein the hybrid parallel training task comprises a plurality of concurrently executed training task copies; Correspondingly, the determining a joint optimization decision result corresponding to the hybrid parallel training task based on the calculation time, the tensor parallel communication data volume, the pipeline parallel communication data volume and the data parallel gradient synchronous communication data volume with the aim of minimizing the single iteration total time of the hybrid parallel training task under the condition of meeting the pre-acquired calculation resource constraint and the optical network resource constraint comprises the following steps: Constructing respective objective functions of the training task copies based on a preset single iteration total time function form according to the calculation time, the tensor parallel communication data quantity, the pipeline parallel communication data quantity and the data parallel gradient synchronous communication data quantity which are respectively corresponding to the training task copies, wherein the objective functions are used for representing the data center deployment position of the single iteration total time relative to the pipeline stage of one training task copy, the calculation resources distributed to each pipeline stage of the training task copy and the optical network path and bandwidth resources distributed for cross-data center communication of the training task copy; targeting a weighted sum of minimizing a total time of a single iteration of each of the training task replicas, wherein the weight of the weighted sum is determined according to a preset priority or a preset resource requirement of each of the training task replicas; And under the constraint of the computing resources and the constraint of the optical network resources, uniformly optimizing the deployment position of the data center, the distribution of the computing resources and the optical network path and bandwidth resources communicated across the data center of the pipeline stage of each training task copy so as to determine the joint optimization decision result corresponding to each training task copy.
3. The hybrid parallel training-oriented cross-data center computing and optical network joint scheduling method according to claim 2, wherein the constructing respective objective functions of each training task replica based on a preset single iteration total time function form according to the computing time, the tensor parallel communication data volume, the pipeline parallel communication data volume and the data parallel gradient synchronous communication data volume corresponding to each training task replica includes: For each pipeline stage in each training task copy, establishing a first functional relationship between the computing execution time of the pipeline stage and computing resources allocated to the pipeline stage according to the computing time corresponding to the pipeline stage; for each pipeline stage in each training task copy, determining tensor parallel communication time of the pipeline stage according to the tensor parallel communication data volume corresponding to the pipeline stage; For each pair of adjacent pipeline stages in each training task copy, establishing a second functional relationship between pipeline parallel communication time between adjacent pipeline stages and a data center deployment position of the adjacent pipeline stages and an optical network path and bandwidth resources distributed for cross-data center communication of the training task copy according to the pipeline parallel communication data quantity between the adjacent pipeline stages; And establishing a third functional relationship between the data parallel gradient synchronous communication time of the training task copy and an optical network path and bandwidth resources distributed for cross-data center communication of the training task copy according to the data parallel gradient synchronous communication data volume corresponding to the training task copy for each training task copy; and aiming at each training task copy, constructing an objective function of the training task copy according to the first functional relation, the second functional relation, the third functional relation and the tensor parallel communication time by combining the preset pipeline parallelism of the training task copy.
4. The hybrid parallel training-oriented cross-data center computing and optical network joint scheduling method according to claim 1, further comprising: In the training process of executing the model corresponding to the hybrid parallel training task, monitoring the actual execution time of each pipeline stage and the communication state of the cross data center in real time; If the deviation between the currently monitored actual execution time and the calculation time exceeds a preset threshold value, and/or if the currently monitored cross-data center communication state meets a preset trigger condition, correcting the calculation time according to the actual execution time, updating the available amount information of the optical network resources according to the cross-data center communication state, and adjusting at least one of the data center deployment position of the currently executing pipeline stage, the calculation resources allocated to each pipeline stage and the optical network path and bandwidth resources allocated for cross-data center communication based on the corrected calculation time and the updated available amount information of the optical network resources so as to redetermine a corresponding joint optimization decision result; Based on the redetermined joint optimization decision result, at least one of a data center deployment position of a pipeline stage of the hybrid parallel training task currently being executed, computing resources allocated to each pipeline stage, and optical network paths and bandwidth resources allocated for cross-data center communication is adjusted, and training of a model corresponding to the hybrid parallel training task is continuously executed.
5. The hybrid parallel training oriented cross-data center computing and optical network joint scheduling method of claim 1, further comprising, prior to said determining the data parallel gradient synchronous communication data volume corresponding to the hybrid parallel training task, the computation time of each pipeline stage on a single micro-batch, the tensor parallel communication data volume inside each of the pipeline stages, and the pipeline parallel communication data volume between adjacent ones of the pipeline stages: Acquiring a training request of a mixed parallel training task aiming at a model to be scheduled, wherein the training request comprises a model structure and mixed parallel configuration of the mixed parallel training task; And dividing the model into a plurality of continuous pipeline stages according to the pipeline parallelism in the mixed parallel configuration by aiming at minimizing the maximum value of the calculation time of each pipeline stage on a single micro-batch so as to obtain a division scheme of the pipeline stages.
6. The hybrid parallel training-oriented cross-data center computing and optical network joint scheduling method according to claim 1, further comprising, before the determining the joint optimization decision result corresponding to the hybrid parallel training task: Acquiring a current computing resource state and an optical network resource state of a cross-data center environment, wherein the computing resource state comprises the available GPU quantity and the available storage capacity of each data center; Determining a computing resource constraint according to the computing resource state, wherein the computing resource constraint comprises that the total number of the GPUs distributed to all the mixed parallel training tasks to be scheduled in each data center does not exceed the available GPU number of the data center, and the storage resources distributed to each pipeline stage do not exceed the available storage capacity of the data center; And determining an optical network resource constraint according to the optical network resource state, wherein the optical network resource constraint comprises that a frequency slot allocated for cross-data center communication does not exceed the total frequency slot capacity of the link on each link, and the frequency slot occupied by the cross-data center communication of the same hybrid parallel training task meets the preset frequency spectrum continuity constraint, frequency spectrum consistency constraint and guard band isolation constraint.
7. The hybrid parallel training-oriented cross-data center computing and optical network joint scheduling method of claim 1, wherein the time model formula comprises a computation total formula, a computation time formula, a data parallel traffic formula, a tensor parallel traffic formula and a pipeline parallel traffic formula; Correspondingly, according to the model structure of the hybrid parallel training task to be scheduled, the hybrid parallel configuration and the division scheme of the pipeline stages, a preset time model formula is adopted to determine the data parallel gradient synchronous communication data volume corresponding to the hybrid parallel training task, the calculation time of each pipeline stage on a single micro batch, the tensor parallel communication data volume inside each pipeline stage and the pipeline parallel communication data volume between the adjacent pipeline stages, including: Determining the calculation total amount of each pipeline stage based on the calculation total amount formula according to the calculation amount of each calculation layer in the model structure and the division scheme of the pipeline stage, and substituting the calculation total amount, the tensor parallelism in the mixed parallel configuration and the preset single GPU calculation force into the calculation time formula to determine the calculation time of each pipeline stage on a single micro batch; determining the total parameter size of the mixed parallel training task according to the parameter size of each calculation layer in the model structure, and determining the data parallel gradient synchronous communication data volume based on the data parallel communication volume formula according to the data parallel degree in the mixed parallel configuration; According to tensor parallel communication data volume of each calculation layer in the model structure and the division scheme of the pipeline stages, tensor parallel communication data volume inside each pipeline stage is determined based on the tensor parallel communication volume formula; and determining pipeline parallel communication data volume between adjacent pipeline stages based on the pipeline parallel communication volume formula according to the activation size of each calculation layer in the model structure and the division scheme of the pipeline stages.
8. The hybrid parallel training-oriented joint scheduling method of cross-data center computing and optical network according to claim 4, wherein the cross-data center communication state meeting a preset trigger condition comprises: The actual bandwidth of the cross-data center communication is at least one of below a preset bandwidth threshold, the communication delay exceeds a preset delay threshold, and optical network link congestion is detected.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the hybrid parallel training oriented cross-data center computing and optical network joint scheduling method of any one of claims 1 to 8 when the computer program is executed by the processor.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a hybrid parallel training oriented cross-data center computing and optical network joint scheduling method according to any of claims 1 to 8.

Description

Cross-data center computing and optical network joint scheduling method and device for hybrid parallel training Technical Field The application relates to the technical field of artificial intelligence and deep learning, in particular to a cross-data center computing and optical network joint scheduling method and equipment for hybrid parallel training. Background With the development of artificial intelligence and deep learning techniques, the parameter scale and training data scale of large models continue to grow, and model training gradually extends from a single server or single data center DC (Data Center) to multiple geographically dispersed data centers to be cooperatively completed. The cross-data center training can fully utilize computing resources distributed in different areas, relieves the limitation of a single data center in terms of computational effort, energy consumption and physical space to a certain extent, and becomes an important development trend of large model training. In a cross-data center training scene, a training task usually adopts a combination of multiple parallel modes such as data parallel, pipeline parallel, tensor parallel and the like, and the parallel modes introduce complex communication requirements such as parameter gradient synchronization, inter-stage activation data transmission and the like while improving the calculation efficiency, and the complex communication requirements are needed to be completed through a wide area network. Meanwhile, the interconnection network between the data centers gradually evolves to the optical network with the characteristics of large bandwidth, low time delay, strong reconfigurability and the like, and provides a high-performance communication basis for cross-data center training. At present, a scheduling scheme for large-model hybrid parallel training is mainly designed for the interior of a single data center. In such schemes, training tasks are deployed within the same data center, and scheduling decisions are determined primarily based on the state of computing resources (e.g., the number of available graphics processor GPUs, node computational loads, etc.) within the data center. Data exchange in the training process is completed through a high-speed interconnection network in the data center, communication resources are regarded as fixed or sufficient by default, and communication resources are rarely modeled and controlled independently in the scheduling process. Meanwhile, the training scheduling system and the bottom network resource management system are mutually independent, and the dynamic sensing and joint optimization of the network link state, bandwidth allocation or communication delay are not carried out in the scheduling decision process. However, the prior art scheme has the obvious defect that the computing resource scheduling and the optical network resource allocation are difficult to effectively cooperate due to the fact that the computing scheduling and the network resource allocation are mutually independent and lack of a uniform cooperative mechanism under the cross-data center training scene. The method comprises the steps that scheduling decision is only based on the state of local computing resources, communication dependency relations among training subtasks are not explicitly modeled, bottom network resources are regarded as uncontrollable or static resources, optical network bandwidth, paths and time delay are not included in a scheduling decision process, and under the concurrent scene of multiple training tasks, the situation that part of data center computing resources are idle and congestion occurs across data center links is easy to occur, so that the overall resource utilization efficiency is low. Therefore, a technical scheme capable of comprehensively considering the calculation characteristics and the communication characteristics of the training task and scheduling in cooperation with the optical network resource state in a large model training scene crossing the data center is needed. Disclosure of Invention In view of this, embodiments of the present application provide a hybrid parallel training oriented cross-data center computing and optical network joint scheduling method and apparatus to obviate or mitigate one or more disadvantages in the prior art. The application provides a cross-data center computing and optical network joint scheduling method for hybrid parallel training, which comprises the following steps: According to the model structure, the mixed parallel configuration and the division scheme of pipeline stages of the mixed parallel training task to be scheduled, a preset time model formula is adopted to determine the data parallel gradient synchronous communication data volume corresponding to the mixed parallel training task, the calculation time of each pipeline stage on a single micro batch, the tensor parallel communication data volume inside each pipeline stage a