CN-122021969-A - Online switching method of parallel training strategies and deep learning model training system

CN122021969ACN 122021969 ACN122021969 ACN 122021969ACN-122021969-A

Abstract

The disclosure provides an online switching method of parallel training strategies and a deep learning model training system. The method comprises the steps of determining a slicing mode of new communication group information and model parameters according to new parallel dimension division information and new working node topology information, further generating a mapping scheme for model parameter slicing migration according to old parallel dimension division information and old working node topology information, and migrating the model parameters to the new working nodes based on the slicing mode and the mapping scheme. The new communication group information may be used to enable the communication group, the enabled communication group and model parameters migrated to the new work node for the new work node to perform parallel model training under the new parallel dimension division. Thus, the present disclosure enables online fast switching of parallel training strategies by enabling the corresponding communication groups, reconstructing parameter mapping, and performing minimal cost repartitioning and migration at runtime.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity

Assignees

北京无问芯穹科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (11)

1. An on-line switching method of parallel training strategies, the method being for deep learning model training and comprising: Determining new communication group information according to the new parallel dimension division information and the new working node topology information; Determining a slicing mode of each model parameter of the deep learning model according to the new parallel dimension division information and the new working node topology information, and further generating a mapping scheme for model parameter slicing migration according to the old parallel dimension division information and the old working node topology information, and Migrating model parameters to a new working node based on the fragmentation mode and the mapping scheme; the new communication group information is used for enabling a communication group, and the enabled communication group and model parameters migrated to the new working node are used for the new working node to execute parallel model training under new parallel dimension division.
2. The method of claim 1, wherein, The new parallel dimension splitting information includes at least one of: Tensor parallel TP-cut information, Pipelined parallel PP segmentation information, and Expert parallel EP allocation information; Determining new communication group information according to the new parallel dimension division information and the new working node topology information comprises: determining a new global communication group based on the new working node topology information, and Determining at least one of the following according to the new parallel dimension division information and the new working node topology information: The TP sub-communication group is a group, PP sub-communication group, and The EP sub-communication group.
3. The method of claim 1 or 2, wherein determining a sharding manner of each model parameter of the deep learning model from the new parallel dimension partitioning information and the new working node topology information, and further generating a mapping scheme for model parameter sharding migration from old parallel dimension partitioning information and old working node topology information comprises: Extracting parallel metadata of each virtual parameter in a global virtual parameter space according to the new parallel dimension division information, wherein the global virtual parameter space is constructed as a unified logic view of model parameters of the deep learning model; binding the physical fragments of the model parameters held by the old working nodes for executing parallel model training under the old parallel dimension division with corresponding virtual parameter entries in the global virtual parameter space to obtain binding information; Traversing the global virtual parameter space according to the parallel metadata and the binding information to determine the slicing mode and the mapping scheme.
4. The method of claim 3, wherein, Extracting parallel metadata of each virtual parameter in the global virtual parameter space according to the new parallel dimension division information comprises at least one of the following: Determining TP parameter distribution state of the virtual parameter according to TP segmentation information, taking the TP parameter distribution state as TP parallel metadata, Determining a new working node where the virtual parameter is located according to the PP segmentation information as PP parallel metadata, Determining a new working node where the virtual parameter is located according to the EP distribution information, and using the new working node as EP parallel metadata; and traversing the global virtual parameter space according to the parallel metadata to determine a segmentation mode and the mapping scheme of each model parameter comprises at least one of the following: performing dimension reconstruction on all virtual parameters according to respective TP parallel metadata according to TP segmentation information to obtain virtual parameter distribution states, According to the PP segmentation information, the PP parallel metadata of each virtual parameter are aggregated according to layers to obtain virtual parameter combinations on each new working node participating in layer calculation, and And dividing virtual parameters serving as non-shared expert parameters according to the EP distribution information and the EP parallel metadata to obtain expert parameter combinations on each new working node participating in expert model calculation.
5. The method of claim 1 or 2, further comprising: Determining an optimizer state parameter distribution scheme according to the data parallel DP segmentation information, the segmentation mode and the mapping scheme in response to starting the local distribution of the optimizer parameters, and And migrating the state parameters of the optimizer to corresponding new working nodes based on the state parameter distribution scheme of the optimizer.
6. The method of claim 1 or 2, wherein migrating model parameters to a new working node based on the sharding approach and mapping scheme comprises: in response to determining that the new operational node includes an on-edge node, performing model parameter-granularity transmission during the migration of the model parameters to the new operational node, and The model parameter granularity transmission comprises the following steps: Loading parallel metadata of current model parameters under the new parallel dimension division; performing a point-to-point transfer to complete the migration of the current model parameters from the old working node to the new working node, and The buffer of the current model parameters in the old working node is released, The edge nodes are working nodes which participate in parallel model training before and after parallel training strategy switching.
7. The method of claim 1 or 2, further comprising: In response to determining that the number of new working nodes is greater than the number of old working nodes and that the new working nodes include new nodes, performing the following in the new nodes: In the process of the old working node executing parallel training, creating a new process and executing an initialization operation, initializing a local communication group and establishing a communication connection according to the new communication group information, The newly built node is a working node which does not participate in parallel model training before parallel training strategy switching and has no continuous training process.
8. The method of claim 1 or 2, further comprising: reconstructing a data loader and a data iterator according to the new parallel dimension division information and the new working node topology information based on a cached unified data set to provide input data in parallel model training under the new parallel dimension division, wherein the unified data set is created and cached for a data format and structure of non-parameter data involved in the deep learning model training process.
9. A deep learning model training system for on-line switching of parallel training strategies, and comprising: A communication management module for determining new communication group information according to the new parallel dimension division information and the new working node topology information, and A re-slicing and migration module for determining a slicing mode of each model parameter of the deep learning model according to the new parallel dimension division information and the new working node topology information, further generating a mapping scheme for model parameter slicing migration according to the old parallel dimension division information and the old working node topology information, and migrating the model parameters to the corresponding new working nodes based on the slicing mode and the mapping scheme, The new communication group information is used for enabling a communication group, and the enabled communication group and model parameters migrated to the new working node are used for the new working node to execute parallel model training under new parallel dimension division.
10. The system of claim 9, further comprising: And the state management module is used for caching a unified data set, reconstructing a data loader and a data iterator according to the new parallel dimension division information and the new working node topology information based on the unified data set, and providing input data in parallel model training under the new parallel dimension division, wherein the unified data set is created and cached for the data format and structure of non-parameter data involved in the deep learning model training process.
11. The system of claim 9 or 10, further comprising: the state management module is used for releasing the video memory required by the old working node to execute the training application and loading the training state information required by the new working node to execute the training before the re-slicing and migration module migrates the model parameters, and applying the video memory required by the new working node to execute the training after the re-slicing and migration module migrates the model parameters.

Description

Online switching method of parallel training strategies and deep learning model training system Technical Field The disclosure relates to the field of artificial intelligence, and in particular relates to an online switching method of parallel training strategies and a deep learning model training system. Background In the training of large-scale distributed deep learning models, training tasks typically rely on specific parallel strategies to allocate computing resources (e.g., GPUs), and enable efficient training of the model through GPU clusters. The currently prevailing training framework (e.g., megatron) relies on static parallel policies, i.e., once a training task is started, its parallel policies and GPU resource configuration cannot be altered. If a failure occurs during training, or if it is necessary to adjust the parallel policy, increase or decrease GPU resources, related solutions typically rely on checkpoint (checkpoint) mechanisms. The mechanism saves information such as current model and state to the storage device when training is interrupted, and loads the latest check point when the task is restarted. However, as the training scale increases, the overhead of such checkpointing schemes becomes extremely high, potentially requiring tens of minutes to load and restore state, greatly affecting training efficiency. In addition, in dynamic workload scenarios such as reinforcement learning, the long tail effect of the reasoning component often leads to reduced GPU utilization. This inefficient utilization not only affects the training process, but also restricts the full use of computing resources. Therefore, the dynamic allocation requirement for computing resources becomes an important direction for improving training efficiency and resource utilization rate, and particularly, the dynamic allocation requirement is particularly outstanding in a training platform (such as a cloud platform) environment based on a large-scale GPU cluster. Disclosure of Invention Therefore, the present disclosure proposes a deep learning model training scheme, and may be implemented as an online switching method and a deep learning model training system for a parallel training strategy, which implement online fast switching of the parallel training strategy by enabling a required communication group, reconstructing parameter mapping, and performing least costly repartition and migration at runtime. The scheme can dynamically adjust the parallel strategy with extremely low cost, so that the parallel strategy suitable for current execution can be flexibly selected at different stages of training, and the interruption of training tasks caused by faults can be reduced, thereby improving the continuity and efficiency of the training process. According to a first aspect of the disclosure, an online switching method of a parallel training strategy is provided, the method is used for deep learning model training and comprises the steps of determining new communication group information according to new parallel dimension division information and new work node topology information, determining a slicing mode of each model parameter of the deep learning model according to the new parallel dimension division information and the new work node topology information, further generating a mapping scheme for model parameter slicing migration according to old parallel dimension division information and old work node topology information, and migrating model parameters to a new work node based on the slicing mode and the mapping scheme, wherein the new communication group information is used for enabling a communication group, and the enabled communication group and the model parameters migrated to the new work node are used for the new work node to execute parallel model training under the new parallel dimension division. Optionally, the new parallel dimension division information comprises at least one of tensor parallel TP segmentation information, pipeline parallel PP segmentation information and expert parallel EP allocation information, and determining new communication group information according to the new parallel dimension division information and new working node topology information comprises determining a new global communication group according to the new working node topology information and determining at least one of TP sub-communication groups, PP sub-communication groups and EP sub-communication groups according to the new parallel dimension division information and the new working node topology information. Optionally, determining a slicing manner of each model parameter of the deep learning model according to the new parallel dimension division information and the new working node topology information, and further generating a mapping scheme for model parameter slicing migration according to old parallel dimension division information and old working node topology information, wherein the global virtual paramete