CN-122019186-A - Model distributed training method and device

CN122019186ACN 122019186 ACN122019186 ACN 122019186ACN-122019186-A

Abstract

The disclosure provides a model distributed training method and device, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, natural language processing, multi-mode models, deep learning technology and distributed parallel computing. The method comprises the steps of performing performance analysis based on multiple iterations of a model, constructing a sample calculation load model, estimating calculation load of each sample in a single iteration by using the sample calculation load model to obtain calculation load estimation results of each sample, rearranging distribution modes of the samples among data parallel nodes in the single iteration based on the calculation load estimation results of each sample to obtain rearranged samples, and inputting the rearranged samples to the data parallel nodes for forward propagation and backward propagation and gradient synchronization.

Inventors

ZHANG QIU
SHEN JUN
SHEN KUN
ZHANG HENGHUA
SHEN DOU
LI SHIYONG
WANG YANPENG
XIAO ZHIWEN

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (13)

1. A model distributed training method, comprising: Performing performance analysis based on multiple iterations of the model, and constructing a sample calculation load model; Estimating the calculation load of each sample in a single iteration by using the sample calculation load model to obtain a calculation load estimation result of each sample; Rearranging the distribution mode of the samples among the data parallel nodes in the single iteration based on the calculation load estimation result of each sample to obtain rearranged samples; and inputting the rearranged samples to the data parallel nodes for forward propagation and backward propagation, and performing gradient synchronization.
2. The method of claim 1, wherein the model-based performance analysis is performed on a plurality of iterations, constructing a sample computational load model, comprising: Representing the sample computation load as a quadratic function of the sample sequence length; Collecting the sample sequence length, the sample number, the forward propagation time and the backward propagation time in each iteration; substituting the sample sequence length, the sample number, the forward propagation time consumption and the backward propagation time consumption in each iteration into the quadratic function respectively, and solving coefficients of the quadratic function; Substituting the coefficients of the quadratic function into the quadratic function to obtain the sample calculation load model.
3. The method of claim 2, wherein the quadratic function includes a quadratic term for representing a quadratic complexity overhead of the attention computation, a linear term for representing a linear complexity computation overhead of at least one of the embedded layer, the multi-layer perceptron, and the communication, and a constant term for representing a fixed system overhead independent of the sample sequence length.
4. A method according to claim 2 or 3, wherein said substituting the sample sequence length, the number of samples, the forward propagation time and the backward propagation time within each iteration into the quadratic function, respectively, and solving coefficients of the quadratic function comprises: And carrying out continuous micro-approximation on the maximum value operation in the quadratic function by utilizing softmax, constructing a continuous micro-optimized objective function based on least square, and solving the coefficient of the quadratic function by a numerical optimization method under a non-negative constraint condition.
5. The method of claim 4, wherein the continuously micro-approximated by using softmax to perform maximum operation in the quadratic function, constructing a continuously micro-optimized objective function based on least squares, and solving coefficients of the quadratic function by a numerical optimization method under a non-negative constraint condition, comprising: Defining a sample sequence length square term, a sample sequence length term and a scaling factor of a sample number term; constructing a softmax_max function, and continuously and slightly approximating the maximum value operation; constructing a loss function, and calculating the error between the load predicted value of the quadratic function and the actual total time consumption; and calling a numerical minimization optimization method, and solving the coefficient which minimizes the loss function under the coefficient non-negative constraint.
6. The method according to claim 1, wherein the rearranging the distribution manner of the samples among the data parallel nodes in the single iteration based on the calculation load estimation result of each sample, to obtain rearranged samples, includes: Within a single iteration, samples with different computational load estimation results are distributed to a plurality of data parallel nodes.
7. The method of claim 6, wherein the distributing samples of different computational load estimates to the plurality of data-parallel nodes within a single iteration comprises: sorting samples according to the calculation load estimation result from high to low; And sequentially distributing the sequenced samples to the data parallel nodes with the minimum current total load.
8. The method of claim 7, wherein the sum of the sequence lengths of the samples on each data parallel node does not exceed a preset threshold.
9. The method of any of claims 1-8, wherein the model comprises at least one of a large language model, a multi-modal model, a transducer class model.
10. A model distributed training apparatus comprising: the construction module is configured to perform performance analysis based on multiple iterations of the model and construct a sample calculation load model; the estimating module is configured to estimate the calculation load of each sample in a single iteration by using the sample calculation load model to obtain a calculation load estimation result of each sample; The rearrangement module is configured to rearrange the distribution mode of the samples among the data parallel nodes in the single iteration based on the calculation load estimation result of each sample to obtain rearranged samples; and the propagation module is configured to input the rearranged samples to the data parallel nodes for forward propagation and backward propagation and perform gradient synchronization.
11. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.

Description

Model distributed training method and device Technical Field The present disclosure relates to the field of artificial intelligence, and more particularly to the fields of computer vision, natural language processing, multimodal models, deep learning techniques, and distributed parallel computing techniques. Background In large model distributed training, DP (DATA PARALLEL ) mechanism is a mainstream implementation manner, and training data is divided into multiple DP ranks (data parallel nodes) to be executed in parallel so as to improve training efficiency. In engineering practice, a fixed-LENGTH PACKING (fixed length packing) strategy is widely applied, a plurality of samples are combined into a pack (packing block) with a fixed token number, the calculated amount of each DP Rank on a linear complexity operator is ensured to be consistent with the occupation of a display memory, and the engineering requirement of large-scale training is adapted. In order to optimize load fluctuation caused by sample length difference, the related technology introduces a sample length barrel dividing mechanism in a data loading stage, and gradually incorporates samples in different length intervals by matching with a staged training strategy so as to reduce average load difference of each DP Rank in a training stage scale. Aiming at the Attention mechanism computing link, part of systems adopt flash Attention (flash memory Attention mechanism) or a lengthened Attention interface and other optimization means, so that the computing efficiency in a single DP Rank is focused and improved, and the performance of the power-assisted large-model distributed training is improved. Disclosure of Invention The embodiment of the disclosure provides a model distributed training method, device, equipment, storage medium and program product. According to the first aspect, the embodiment of the disclosure provides a model distributed training method, which comprises the steps of performing performance analysis based on multiple iterations of a model, constructing a sample calculation load model, estimating calculation load of each sample in a single iteration by using the sample calculation load model to obtain calculation load estimation results of each sample, rearranging distribution modes of the samples among data parallel nodes in the single iteration based on the calculation load estimation results of each sample to obtain rearranged samples, and inputting the rearranged samples to the data parallel nodes for forward propagation and backward propagation, and performing gradient synchronization. In a second aspect, an embodiment of the disclosure proposes a model distributed training apparatus, including a construction module configured to perform performance analysis based on multiple iterations of a model to construct a sample computation load model, an estimation module configured to estimate computation load of each sample in a single iteration using the sample computation load model to obtain a computation load estimation result of each sample, a rearrangement module configured to rearrange an allocation manner of the samples between data parallel nodes in the single iteration based on the computation load estimation result of each sample to obtain rearranged samples, and a propagation module configured to input the rearranged samples to the data parallel nodes for forward propagation and backward propagation, and for gradient synchronization. In a third aspect, an embodiment of the present disclosure provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect. In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in the first aspect. In a fifth aspect, embodiments of the present disclosure propose a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first aspect. Nor is it intended to limit the scope of the present disclosure to the critical or important features of the embodiments of the present disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein: FIG. 1 is a flow chart of one embodiment of a model distributed training method according to the present disclos