CN-121998036-A - Distributed training method and device for large model, electronic equipment and storage medium

CN121998036ACN 121998036 ACN121998036 ACN 121998036ACN-121998036-A

Abstract

The invention provides a distributed training method and device for a large model, electronic equipment and a storage medium, and relates to the technical field of model training. The invention provides a cross-domain distributed training method by parallel cooperation of gradient compression and an expert network. In the training process, gradient synchronization and expert parameter routing between target expert sub-networks are controlled through the proposed collaborative compression and dynamic scheduling mechanism, so that the cross-domain traffic can be obviously reduced, and the training efficiency reduction caused by insufficient network bandwidth is avoided. Meanwhile, the invention can realize the self-adaptive adjustment of the expert routing strategy and the gradient compression mode on the premise of keeping the model accuracy stable, and improves the distributed training performance of the large model to be trained under the multi-cluster and heterogeneous calculation force conditions. The invention can effectively support large-scale model training tasks under the conditions of cross-data center, cloud edge coordination and computing power network, and can be used as the bottom layer capacity for the training platform, the dispatching system and the computing power service system to call.

Inventors

SUI CHAO
FANG YUYAN
CUI CHAO
Shen linjiang

Assignees

浪潮通信信息系统有限公司

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (10)

1. A distributed training method for a large model, comprising: Determining a distributed training scheme according to basic information of a large model to be trained, which is input by a user, wherein the distributed training scheme comprises a gradient compression mode and an expert routing scheme; training the model body and at least one target-specific subnetwork in parallel based on the expert routing scheme; in the parallel training process, compressing, transmitting and summarizing model gradient parameters of each target private sub-network based on the gradient compression mode to obtain compressed gradient information; and iteratively updating parameters of the large model to be trained based on the compressed gradient information until the large model to be trained is trained.
2. The method for distributed training of a large model according to claim 1, wherein the compressing, transmitting and summarizing model gradient parameters of each target-specific sub-network based on the gradient compression method to obtain compressed gradient information comprises: dynamically selecting differential coding or quantization coding as the gradient compression mode based on the gradient size, sparsity or variation amplitude of the model gradient parameters; determining a gradient compression ratio of the gradient compression mode based on a cross-domain link state, a training stage and the load of each target private sub-network; compressing model gradient parameters of each target private sub-network based on the gradient compression ratio and the gradient compression mode to obtain gradient parameters of each target private sub-network; and transmitting and summarizing the gradient parameters of each target private sub-network to obtain the compressed gradient information.
3. The method for distributed training of a large model according to claim 1, wherein determining a distributed training scheme according to basic information of the large model to be trained input by a user comprises: Extracting a model training classification dimension set from the basic information, wherein the model training classification dimension set comprises a parameter scale dimension, a task complexity dimension and a data characteristic dimension; Determining cluster information based on node distribution, computational power level and network bandwidth of the wide area distributed cluster; The distributed training scheme is determined based on the cluster information and the model training classification dimension set.
4. The method of distributed training of a large model according to claim 1, wherein the parallel training of at least one target-specific subnetwork based on the expert routing scheme comprises: determining each sub-training scheme and each training resource allocation amount of each target private sub-network based on the expert routing scheme; Adjusting the allocation amount of each training resource based on the adjustment coefficient to obtain the allocation amount of each adjusted training resource; And carrying out parallel training on each target private sub-network based on each adjusted training resource allocation amount and each sub-training scheme.
5. The method according to claim 1, wherein the step of iteratively updating parameters of the large model to be trained based on the compressed gradient information until after the large model to be trained is completed, further comprises: performing performance test on the trained large model based on a standard test data set to obtain performance indexes of the trained large model; And determining the performance grade of the trained large model based on the performance index and the resource consumption index of the large model to be trained in the training process.
6. The method according to claim 1, wherein iteratively updating parameters of a large model to be trained based on the compressed gradient information until the large model to be trained is completed comprises: Slicing and encrypting the compressed gradient information to obtain encrypted slices; Selecting an adaptive transmission strategy according to the link bandwidth, delay and load conditions of the wide area distributed cluster; and transmitting the encryption fragments to each target private sub-network based on the self-adaptive transmission strategy so as to iteratively update the parameters of each target private sub-network and the parameters of the model main body based on the encryption fragments until the training of the large model to be trained is completed.
7. The method according to claim 1, wherein the step of iteratively updating parameters of the large model to be trained based on the compressed gradient information until after the large model to be trained is completed, further comprises: determining deployment nodes and resource allocation schemes of the trained large model based on the performance level, the application scene requirement and the computational power resource distribution of the trained large model; And based on the deployment node and the resource allocation scheme, issuing and deploying the trained large model.
8. A large model distributed training apparatus, comprising: the system comprises a determining module, a training module and a training module, wherein the determining module is used for determining a distributed training scheme according to basic information of a large model to be trained, which is input by a user, the distributed training scheme comprises a gradient compression mode and an expert routing scheme; the training module is used for carrying out parallel training on the model main body and at least one target private sub-network based on the expert routing scheme; the compression module is used for compressing, transmitting and summarizing model gradient parameters of each target private sub-network based on the gradient compression mode in the parallel training process to obtain compressed gradient information; and the updating module is used for iteratively updating parameters of the large model to be trained based on the compressed gradient information until the training of the large model to be trained is completed.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the distributed training method of the large model of any of claims 1 to 7 when the computer program is executed by the processor.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a distributed training method of a large model according to any of claims 1 to 7.

Description

Distributed training method and device for large model, electronic equipment and storage medium Technical Field The present invention relates to the field of model training technologies, and in particular, to a distributed training method and apparatus for a large model, an electronic device, and a storage medium. Background With the rapid development of artificial intelligence technology, large-scale deep learning models represented by Large Language Models (LLMs) exhibit explosive growth in parameters, which have been extended from billions to trillion scales. The computational power and memory resources of a single computer or single computing cluster are far from sufficient to support the training requirements of such models. Therefore, cross-regional, multi-cluster distributed training has become an indispensable key technology for current large-scale model training. The training paradigm aims to aggregate computing power resources of data centers with different geographic positions and different hardware architectures and jointly complete a training task of a huge model. However, network links in a cross-domain environment typically have complex characteristics of narrow bandwidth, high latency, large jitter, and dynamic changes, which makes frequent data exchanges between nodes a core bottleneck that limits overall training efficiency. Meanwhile, in order to further expand the model capacity without significantly increasing the computational cost, an expert-of-experiments (MoE) model structure is widely used. MoE realizes efficient balance of parameter scale and calculation efficiency by introducing a plurality of parallel expert sub-networks and dynamically selecting and activating a part of expert for each input data by a gating network. However, in the distributed training, the dynamic routing characteristic of the MoE model introduces additional difficulties of expert parameter exchange and load balancing, which further aggravates the severe cross-domain communication pressure. Therefore, how to efficiently and synergistically train a large-scale MoE model in such a complex heterogeneous environment has become a technical problem to be solved in the art. The existing large-model distributed training method performs unified management and scheduling on parameter synchronization in the training process through parameter server architecture or aggregate communication operation so as to shield the isomerism of underlying hardware and a network. The method mainly focuses on how to reasonably divide and deploy a huge number of experts on different computing devices, and designs an efficient routing algorithm and a load balancing mechanism to ensure that the computing load of each expert is as uniform as possible, avoid that part of devices are idle and the other part of devices are overloaded, thereby improving the utilization rate of computing resources. However, the distributed training method of the large model has high communication overhead and low model training efficiency. Disclosure of Invention The invention provides a distributed training method, device, electronic equipment and storage medium for a large model, which are used for solving the defect of low efficiency of the distributed training of the large model in the prior art and improving the efficiency of the distributed training of the large model. The invention provides a distributed training method of a large model, which comprises the following steps: Determining a distributed training scheme according to basic information of a large model to be trained, which is input by a user, wherein the distributed training scheme comprises a gradient compression mode and an expert routing scheme; training the model main body and at least one target private sub-network in parallel based on an expert routing scheme; in the parallel training process, compressing, transmitting and summarizing model gradient parameters of each target expert sub-network based on a gradient compression mode to obtain compressed gradient information; and iteratively updating parameters of the large model to be trained based on the compressed gradient information until the large model to be trained is trained. According to the distributed training method of the large model provided by the invention, model gradient parameters of each target expert sub-network are compressed, transmitted and summarized based on a gradient compression mode, so as to obtain compressed gradient information, and the distributed training method comprises the following steps: Dynamically selecting differential coding or quantization coding as a gradient compression mode based on the gradient size, sparsity or variation amplitude of model gradient parameters; determining a gradient compression ratio of a gradient compression mode based on a cross-domain link state, a training stage and loads of all target private sub-networks; compressing model gradient parameters of each target expert sub-netw