CN-121996413-A - Model parallel method and device for multi-machine multi-card scene and electronic equipment

CN121996413ACN 121996413 ACN121996413 ACN 121996413ACN-121996413-A

Abstract

The application relates to the technical field of artificial intelligence, and discloses a model parallel method and device for a multi-machine multi-card scene and electronic equipment, wherein the method comprises the steps of obtaining basic performance data of pipeline operation for model training; the basic performance data comprises performance load information of a plurality of pipeline stages, a scheduling sequence of each pipeline stage is obtained according to the basic performance data of the pipeline operation, and the execution process of the pipeline operation is simulated in sequence according to the scheduling sequence of each pipeline stage to obtain a simulation result of model parallel training. The method enables the user to obtain performance evaluation which is highly close to real execution in a strategy selection stage, effectively solves the problems of high calculation power requirement and high cost of the traditional experiment, and avoids the problem that small-scale deduction cannot accurately predict the performance of a large-scale scene.

Inventors

LI LUN
ZHAO JUNFANG
WANG YAHAN
CUI JIARUI
CHEN WENGUANG

Assignees

中国信息通信研究院
清华大学

Dates

Publication Date: 20260508
Application Date: 20251230

Claims (10)

1.A model parallel method for a multi-machine multi-card scene, comprising: acquiring basic performance data of pipeline operation for model training, wherein the basic performance data comprises performance load information of a plurality of pipeline stages; acquiring a scheduling sequence of each pipeline stage according to basic performance data of pipeline operation; and according to the scheduling sequence of each pipeline stage, sequentially simulating the execution process of pipeline operation to obtain a simulation result of model parallel training.
2. The method of claim 1, wherein obtaining basic performance data for pipelining for model training comprises: dividing a model to be trained into a plurality of pipeline stages according to the parallelism of the pipeline; Acquiring performance load information of a typical sub-graph forming each pipeline stage; and respectively corresponding the plurality of pipeline stages and typical subgraphs of the plurality of pipeline stages to obtain performance load information of each pipeline stage and serve as basic performance data of pipeline operation.
3. The method of claim 1, wherein obtaining the scheduling sequence for each pipeline stage based on the basic performance data of the pipeline operation comprises: generating a time-space diagram according to a pipeline arrangement algorithm and basic performance data; and acquiring a corresponding scheduling sequence according to the time-space diagram.
4. A method according to claim 3, wherein generating a space-time diagram from the pipeline arrangement algorithm and the base performance data comprises: and generating a time-space diagram by utilizing a pipeline arrangement algorithm according to the micro batch number and the pipeline parallelism obtained from the basic performance data.
5. A method according to claim 3, wherein obtaining a corresponding scheduling sequence from a space-time diagram comprises: Setting communication time in the calculation process of the time-space diagram; and generating a complete scheduling scheme according to the time-space diagram and the communication opportunity as a scheduling sequence of each pipeline stage.
6. The method of claim 5, wherein the setting principle of setting the communication opportunity in the calculation of the space-time diagram includes: and sending tensors and receiving tensors in a gap when the calculation execution is finished.
7. The method of claim 5, wherein the setting principle of setting communication opportunities in the calculation of the space-time diagram further comprises: When data communication is carried out, data is transmitted from the previous stage to the next stage, and then reverse transmission is carried out.
8. The method according to any one of claims 1 to 7, wherein sequentially simulating execution of pipeline operations according to the scheduling sequence of each pipeline stage to obtain simulation results of model parallel training comprises: placing the initial event into a global queue to perform event queue initialization operation; Taking out the event with the earliest time stamp from the queue for processing, and checking whether to unlock the new dependent operation after the processing is completed; packaging the new dependent operation as a new event and inserting the new event into an event queue for processing; And after all the events in the event queue are processed, obtaining simulation results of the parallel training of the model.
9. Model parallelism apparatus for a multi-machine multi-card scene, comprising a processor and a memory storing program instructions, characterized in that the processor is configured to execute the model parallelism method for a multi-machine multi-card scene according to any one of claims 1 to 8 when the program instructions are run.
10. An electronic device, comprising: An electronic device body; the model parallel device for a multi-machine multi-card scenario of claim 9, mounted to the electronic equipment body.

Description

Model parallel method and device for multi-machine multi-card scene and electronic equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a model parallel method and device for a multi-machine multi-card scene and electronic equipment. Background With the rapid development of artificial intelligence technology, deep learning models, particularly large-scale language models based on a transducer architecture, have exponentially increased in parameter number, rapidly advancing from billions to trillions. The abrupt expansion of the scale brings great calculation power and video memory requirements, so that the training mode of a single machine and a single card can not meet the training task of the existing large model. To solve the bottleneck of the computing power, the related art discloses parallel training technology, including data parallel, tensor parallel and pipeline parallel. Pipeline parallelism is achieved by splitting a huge model network layer into multiple phases and distributing the phases to different computing devices (e.g., GPUs), with intermediate activation values and gradients communicated between the devices via point-to-point communications. In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art: in the hybrid parallel architecture, different model structures and hardware environments (such as communication bandwidth and computing power) correspond to different optimal hybrid parallel strategies. There is currently a lack of efficient, low cost mechanisms to evaluate and select strategies. Researchers and engineers often rely on rules of thumb or make small-scale deductions, making it difficult to accurately predict the true performance under large-scale clusters. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art. Disclosure of Invention The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows. The embodiment of the disclosure provides a model parallel method and device for a multi-machine multi-card scene and electronic equipment, which can improve the authenticity of performance evaluation obtained in a strategy selection stage. In some embodiments, the model parallel method for the multi-machine multi-card scene comprises the steps of obtaining basic performance data of pipeline operation for model training, wherein the basic performance data comprises performance load information of a plurality of pipeline stages, obtaining a scheduling sequence of each pipeline stage according to the basic performance data of the pipeline operation, and sequentially simulating an execution process of the pipeline operation according to the scheduling sequence of each pipeline stage to obtain a simulation result of the model parallel training. The method comprises the steps of obtaining basic performance data of pipeline operation for model training, dividing a model to be trained into a plurality of pipeline stages according to pipeline parallelism, obtaining performance load information of typical subgraphs forming each pipeline stage, and respectively corresponding the plurality of pipeline stages and the typical subgraphs of the plurality of pipeline stages to obtain the performance load information of each pipeline stage and serve as the basic performance data of the pipeline operation. Optionally, the scheduling sequence of each pipeline stage is obtained according to basic performance data of pipeline operation, wherein the scheduling sequence comprises the steps of generating a time-space diagram according to a pipeline arrangement algorithm and the basic performance data, and obtaining the corresponding scheduling sequence according to the time-space diagram. Optionally, generating the time-space diagram according to the pipeline arrangement algorithm and the basic performance data comprises generating the time-space diagram by utilizing the pipeline arrangement algorithm according to the micro-batch number and the pipeline parallelism obtained from the basic performance data. Optionally, according to the space-time diagram, the corresponding scheduling sequence is obtained, wherein the method comprises the steps of setting communication time in the calculation process of the space-time diagram, and generating a complete scheduling scheme according to the space-time diagram and the communication time to serve as the sche