CN-122019058-A - Training method of hybrid expert model, and processing method and device of long-sequence task

CN122019058ACN 122019058 ACN122019058 ACN 122019058ACN-122019058-A

Abstract

The disclosure provides a training method of a hybrid expert model, a processing method and a processing device of a long-sequence task, and relates to the technical field of artificial intelligence such as hybrid expert model, model training, long-sequence training, expert parallelism, collective communication operation and the like. The method comprises the steps of training a mixed expert model to be trained according to a received long-sequence training task in a training mode, namely splitting a forward propagation stage of an original combined set communication operation into recovery operation of each calculation result and aggregation operation of fusing all recovered calculation, changing the aggregation operation into execution on a calculation time sequence to obtain execution arrangement, adopting an interleaved forward and backward scheduling mode under the execution arrangement to obtain an execution plan, and training all batches of data forming the long-sequence training task according to the execution plan to obtain the trained mixed expert model. The scheme can improve the training efficiency of the mixed expert model on long-sequence tasks.

Inventors

ZHANG LINGYUN
GU SHILEI
WEI ZHIHAO
ZHANG HENGHUA
SHEN DOU
LI SHIYONG
WANG YANPENG
XIAO ZHIWEN

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (15)

1. A method of training a hybrid expert model, comprising: according to the received long-sequence training task, training the mixed expert model to be trained according to the following training mode: splitting a forward propagation stage of a combined aggregate communication operation originally executed only on a communication timing into a recovery operation of each calculation result and an aggregation operation of fusing each recovered calculation; The recovery operation is still executed on the communication time sequence, and the aggregation operation is executed on the calculation time sequence to obtain execution arrangement, wherein the execution sequence of the aggregation operation on a time axis is connected with the recovery operation of the same batch of data; adopting an staggered forward and backward scheduling mode under the execution arrangement to obtain an execution plan; And training each batch of data forming the long-sequence training task according to the execution plan to obtain a trained mixed expert model.
2. The method of claim 1, wherein the execution plan adjusts the number of data batches processed by the warm-up phase according to the size of the different expert sub-models for the different expert sub-models constituting the hybrid expert model such that the forward propagation phase, the backward propagation phase of the different data batches after entering the steady-state training phase are performed in a staggered manner on a time axis with respect to collective communication operations required for the expert to be parallel.
3. The method of claim 1, wherein training batches of data that make up the long-sequence training task according to the execution plan comprises: after the batch data are sequentially executed according to the execution plan to obtain a first intermediate activation value in the forward propagation stage of the multi-head self-attention layer, the attention post-processing layer and the multi-layer perceptron, the first intermediate activation value is not reserved, and input activation used for calculating the first intermediate activation value is unloaded from a video memory to a memory; Before each batch of data is executed according to the execution plan to execute the counter-propagation stages of the multi-head self-attention layer, the attention post-processing layer and the multi-layer perceptron respectively, loading the input activation into the video memory from the memory so as to calculate the first intermediate activation value again according to the input activation for use in the counter-propagation stage.
4. The method according to claim 3, wherein after the performing the multi-head self-attention layer, the attention post-processing layer, and the multi-layer perceptron sequentially on each batch of data according to the execution plan to obtain a first intermediate activation value, offloading input activation used for calculating the first intermediate activation value from a video memory to a memory without retaining the first intermediate activation value comprises: and sequentially executing the multi-head self-attention layer, the attention post-processing layer and the forward propagation stage of the multi-layer perceptron according to the execution plan by each batch of data to obtain the next time slicing of the first intermediate activation value, deleting the first intermediate activation value, and unloading the input activation used for calculating the first intermediate activation value from the video memory to the memory in an asynchronous mode.
5. The method of claim 3, wherein loading the input activations from the memory to the video memory before the batches of data are to be executed in the execution plan for the counter-propagating phases of the multi-headed self-attention layer, the post-attention processing layer, and the multi-layer perceptron, respectively, comprises: And loading the input activation from the memory to the video memory in an asynchronous mode in the previous time slicing of the backward propagation stage of the multi-head self-attention layer, the attention post-processing layer and the multi-layer perceptron respectively to be executed according to the execution plan.
6. A method according to claim 3, further comprising: Splitting the weight updating operation originally contained in the back propagation stage of the multi-layer perceptron executed on the computing time sequence to obtain a new back propagation operation and a new weight updating operation of the multi-layer perceptron which are still arranged on the computing time sequence and are executed in sequence, wherein the execution sequence of the new weight updating operation on the time axis is later than the forward propagation stage of the distribution set communication operation executed on the communication time sequence and earlier than the back propagation stage of the attention post-processing layer executed on the computing time sequence.
7. The method of claim 3, wherein the order of execution of the reclamation operations on the time axis is further followed by a forward propagation phase of an attention post-processing layer of the same batch of data executing on the computation timing, and the order of execution of the aggregation operations on the time axis is further followed by a forward propagation phase of a multi-layer perceptron of the same batch of data executing on the computation timing.
8. A method according to claim 3, further comprising: After the batch data are executed according to the execution plan and a second intermediate activation value is obtained in the forward propagation stage of the other functional layers except the multi-head self-attention layer, the attention post-processing layer and the multi-layer perceptron, unloading the second intermediate activation value from the video memory to the memory; Before each batch of data is executed according to the execution plan to execute the back propagation stages of the functional layers except the multi-head self-attention layer, the attention post-processing layer and the multi-layer perceptron, the second intermediate activation value is loaded to the video memory from the memory for the corresponding functional layer to use in the back propagation stage.
9. The method of any one of claims 1-8, wherein the training pattern is controlled by a preset scheduler configured to orchestrate the forward propagation phase, the backward propagation phase, the various collective communication operations required by expert parallelism, and the data transfer operations between the memory and the display so that the execution period of the data transfer operations on the time axis overlaps with the execution period of at least one of the forward propagation phase, the backward propagation phase, and the various collective communication operations.
10. A method of processing a long sequence of tasks, comprising: acquiring a task to be processed with the length exceeding a preset length threshold; Invoking a preset target mixed expert model to process the task to be processed to obtain an output processing result, wherein the target mixed expert model is a trained mixed expert model obtained by the training method of the mixed expert model according to any one of claims 1-9.
11. A training device of a hybrid expert model, comprising: The long-sequence training task receiving and training treatment unit is configured to train a mixed expert model to be trained according to the received long-sequence training task in a training mode, wherein the forward propagation stage of the combined set communication operation which is originally executed on a communication time sequence is divided into recovery operation of all calculation results and aggregation operation of merging all the recovered calculation results, the recovery operation is still executed on the communication time sequence, the aggregation operation is executed on the calculation time sequence to obtain execution arrangement, the execution sequence of the aggregation operation on a time axis is connected with the recovery operation of the same batch of data, an interleaved forward-backward scheduling mode is adopted under the execution arrangement to obtain an execution plan, and all the batches of data forming the long-sequence training task are trained according to the execution plan to obtain the trained mixed expert model.
12. A processing apparatus for long-sequence tasks, comprising: A problem to be processed acquisition unit, the method comprises the steps of acquiring a task to be processed, wherein the length of the task exceeds a preset length threshold; and the mixed expert model calling processing unit is configured to call a preset target mixed expert model to process the task to be processed to obtain an output processing result, wherein the target mixed expert model is a trained mixed expert model provided by the mixed expert model training device according to claim 11.
13. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the hybrid expert model of any one of claims 1-9 and/or the processing method of the long sequence task of claim 10.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of training a hybrid expert model according to any one of claims 1-9 and/or the method of processing long sequence tasks according to claim 10.
15. Computer program product comprising a computer program which, when executed by a processor, implements the steps of the training method of a hybrid expert model according to any of claims 1-9 and/or the steps of the processing method of long sequence tasks according to claim 10.

Description

Training method of hybrid expert model, and processing method and device of long-sequence task Technical Field The present disclosure relates to the field of data processing, and in particular, to the field of artificial intelligence technologies such as hybrid expert model, model training, long-sequence training, expert parallelism, and collective communication operation, and more particularly, to a training method of a hybrid expert model and a processing method of a long-sequence task, and corresponding apparatuses, electronic devices, computer-readable storage media, and computer program products. Background In the current large model training field, hybrid expert models are attracting attention because they can significantly increase model capacity. However, in dealing with long sequence training tasks, the model training process is subject to a serious challenge, i.e., hybrid expert models typically rely on expert parallel strategies that require frequent collective communication among different computing devices to distribute and aggregate data, and the consolidated collective communication operation of the forward propagation phase acts as an integral, blocking-type communication link whose execution time consumption increases dramatically with increasing sequence length and is strictly serial with the computation process on the device. This results in significant idle computational resources during communication, severely limiting training efficiency, making long-sequence training under limited hardware resources exceptionally difficult and inefficient. Disclosure of Invention The embodiment of the disclosure provides a training method of a hybrid expert model, a processing method of a long-sequence task, and corresponding devices, electronic equipment, a computer-readable storage medium and a computer program product. According to the first aspect, the embodiment of the disclosure provides a training method of a hybrid expert model, which comprises the steps of training the hybrid expert model to be trained according to a received long-sequence training task in a training mode, namely splitting a forward propagation stage of merging and collecting communication operations, which is originally executed only on a communication time sequence, into recovery operations of calculation results and aggregation operations, which are fused with the recovered calculation results, executing the recovery operations on the communication time sequence, and executing the aggregation operations on the calculation time sequence to obtain execution arrangement, wherein the execution order of the aggregation operations on a time axis is connected with the recovery operations of the same batch of data, the execution arrangement is adopted in a staggered forward and backward scheduling mode to obtain an execution plan, and training each batch of data forming the long-sequence training task according to the execution plan to obtain the trained hybrid expert model. According to the second aspect, the embodiment of the disclosure provides a training device of a hybrid expert model, which comprises a long-sequence training task receiving and training processing unit, wherein the long-sequence training task receiving and training processing unit is configured to train the hybrid expert model to be trained according to a received long-sequence training task, the training mode is that a forward propagation stage of merging and collecting communication operations which are originally executed only on a communication time sequence is divided into recovery operations of calculation results and aggregation operations of the recovered calculation results, the recovery operations are still executed on the communication time sequence, the aggregation operations are executed on the calculation time sequence to obtain execution arrangement, the execution order of the aggregation operations is connected with the recovery operations of the same batch of data on a time axis, an interleaved forward-reverse scheduling mode is adopted under the execution arrangement to obtain execution planning, and each batch of data forming the long-sequence training task is trained according to the execution planning to obtain the trained hybrid expert model. In a third aspect, an embodiment of the present disclosure provides a method for processing a long-sequence task, including obtaining a task to be processed with a length exceeding a preset length threshold, and invoking a preset target hybrid expert model to process the task to be processed to obtain an output processing result, where the target hybrid expert model is a trained hybrid expert model obtained by a training method of a hybrid expert model described in the first aspect. According to a fourth aspect, an embodiment of the present disclosure provides a processing device for long-sequence tasks, including a to-be-processed problem obtaining unit configured to obtain to-be-processed tasks wi