CN-121981210-A - Method and device for training large model

CN121981210ACN 121981210 ACN121981210 ACN 121981210ACN-121981210-A

Abstract

The embodiment of the specification provides a method and a device for training a large model, wherein the large model is deployed on a plurality of computing units to perform distributed training, the method comprises the steps of acquiring training state information of the large model in the process of training the large model by adopting a first training strategy, wherein the training state information at least comprises memory usage rates of all computing units, determining a second training strategy based on the training state information if the memory usage rates of a plurality of first computing units are determined to exceed a first memory threshold value, wherein the second training strategy is used for reducing the memory usage rates of a plurality of first computing units, the second training strategy does not change the distributed topology of the large model, switching the first training strategy into the second training strategy after a specified strategy switching condition is achieved, and continuing training the large model based on the second training strategy so as to dynamically adjust and optimize the training strategy based on the training state of the large model under the condition that the training flow of the large model is not interrupted.

Inventors

LIU JUN
ZHANG HAITAO

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260211

Claims (14)

1. A method of training a large model deployed on a plurality of computing units for distributed training, comprising: Acquiring training state information of the large model in the process of training the large model by adopting a first training strategy, wherein the training state information at least comprises the memory utilization rate of each computing unit; If the memory utilization rate of the plurality of first computing units exceeds the first memory threshold, determining a second training strategy based on the training state information, wherein the second training strategy is used for reducing the memory utilization rate of the plurality of first computing units, and the second training strategy does not change the distributed topology of the large model; And after the specified strategy switching condition is reached, switching the first training strategy into the second training strategy so as to continuously train the large model based on the second training strategy.
2. The method of claim 1, wherein the computing unit is a graphics processor GPU.
3. The method of claim 1, wherein the specified policy switching condition comprises after the end of a current training step and before the start of a next training step.
4. The method of claim 1, wherein the training state information further comprises at least one of model structure information of the large model, calculation usage and communication load indexes of each calculation unit, expert routing states corresponding to each calculation unit, a change condition of a batch size of a next training step relative to a batch size of a current training step, and a change condition of an input sequence length of the next training step relative to an input sequence length of the current training step.
5. The method of claim 1, wherein the second training policy indicates that at least some layers deployed in the model portions of the number of first computing units employ an active recalculation mode for forward computation and/or enable a data offloading mechanism in the event that the first training policy indicates that model portions deployed in the number of first computing units employ a full active save mode for forward computation.
6. The method of claim 1, wherein the large model is a large model based on a hybrid expert network architecture, the large model including at least a first expert network layer, N first expert networks of the first expert network layer being deployed on M computing units in an expert parallel manner.
7. The method of claim 6, wherein the training status information further comprises expert routing status corresponding to each first computing element, each expert routing status comprising at least an expert network load of a first expert network deployed by the corresponding first computing element; if it is determined that the memory usage rate of the plurality of first computing units exceeds the first memory threshold, determining a second training policy based on the training state information includes: If it is determined that the memory usage rate of the plurality of first computing units exceeds a first memory threshold, and that the expert network load of at least one first expert network among the first expert networks deployed in the plurality of first computing units exceeds a first load threshold, determining the second training strategy based on the training state information, wherein the second training strategy indicates that at least one first expert network with the expert network load exceeding the first load threshold performs forward computation in an active recalculation mode and/or enables a data offloading mechanism when the first training strategy indicates that at least one first expert network with the expert network load exceeding the first load threshold performs forward computation in a full-scale active saving mode.
8. The method of claim 1, wherein the switching the first training strategy to the second training strategy comprises: The second training strategy is broadcast to all computing units so that all computing units switch the first training strategy to the second training strategy.
9. The method of claim 1, further comprising: if it is determined that the calculated usage rates of the plurality of second calculation units exceed a first calculation threshold value, and the memory usage rates of the plurality of second calculation units do not exceed a second memory threshold value, determining a third training strategy based on the training state information, wherein the third training strategy is used for reducing the calculated usage rates of the plurality of second calculation units, the third training strategy does not change the distributed topology of the large model, and the second memory threshold value does not exceed the first memory threshold value; And switching the first training strategy to the third training strategy after the specified strategy switching condition is reached, so as to continuously train the large model based on the third training strategy.
10. The method of claim 9, wherein, And under the condition that the first training strategy indicates that the model parts deployed in the second computing units conduct forward computation by adopting an activated re-computation mode, the third training strategy indicates that at least part of layers deployed in the model parts of the second computing units conduct forward computation by adopting a full-scale activated preservation mode.
11. The method of claim 1, further comprising: if a fourth training strategy appointed by a user is detected, detecting whether the fourth training strategy is feasible or not based on an appointed detection mode; And after determining that the fourth training strategy is feasible and the specified strategy switching condition is reached, switching the first training strategy into the fourth training strategy so as to train the large model based on the fourth training strategy, wherein the fourth training strategy does not change the distributed topology of the large model.
12. The method of claim 1, further comprising: And in the process of training the large model by adopting the second training strategy, if the training state information of the continuous T training steps is determined to indicate that the loss value oscillation value of the large model exceeds a preset oscillation threshold value, and/or the training state information of at least one training step is determined to indicate that a plurality of third computing units have the problem of insufficient memory, backing to the first training strategy.
13. An apparatus for training a large model deployed on a plurality of computing units for distributed training, the apparatus comprising: The acquisition module is configured to acquire training state information of the large model in the process of training the large model by adopting a first training strategy, wherein the training state information at least comprises the memory utilization rate of each computing unit; The first determining module is configured to determine a second training strategy based on the training state information if it is determined that the memory usage rate of the plurality of first computing units exceeds a first memory threshold, wherein the second training strategy is used for reducing the memory usage rate of the plurality of first computing units, and the second training strategy does not change the parallel distribution topology of the large model; And the first switching module is configured to switch the first training strategy into the second training strategy after the specified strategy switching condition is reached, so as to continuously train the large model based on the second training strategy.
14. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-12.

Description

Method and device for training large model Technical Field The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for training a large model. Background In training of deep learning large models (e.g., large language models), acceleration is often performed by means of high performance computing hardware (e.g., computing units such as GPUs (Graphics Processing Unit, graphics processors) or TPUs (Tensor Processing Unit, tensor processing units)). Such hardware has fixed computing power (e.g., FLOPS) and memory capacity, while large model training requires higher memory and computational power and is easier to reach the upper limit of hardware resources. In order to train a model with high efficiency under a limited hardware resource, a user usually presets a set of static training strategies according to the model structure (such as the parameter number, the layer number, whether to adopt a hybrid expert network architecture, etc.) and the hardware configuration (such as the number of computing hardware and the computing capacity, the memory size, etc.) of the model before the model training starts, and uniformly deploys on all computing hardware. The training strategy generally comprises a memory optimization sub-strategy (such as activating re-calculation, full activation and preservation, and optimizing state unloading), an operator optimization sub-strategy (such as operator fusion), a distributed parallel sub-strategy (such as data parallel, tensor parallel and expert parallel), and the like, and the aim is to maximize the utilization rate of computing hardware and improve the training efficiency of a model on the premise that the memory limit of the computing hardware is not exceeded. However, a static training strategy configuration is generally adopted in the related art, that is, the same set of training strategies is fixedly used in the whole training period of the model, and the static training strategy configuration cannot adapt to the dynamically changing load characteristics in the model training process. For example, in a hybrid expert network (MoE) model, the gating routing mechanism dynamically adjusts the distribution ratio of the token to each expert network as training progresses, which results in some computing hardware, such as a GPU, carrying far higher than average token load later in the model training, causing local Memory overflow (OOM) or computational bottlenecks. For another example, in the late stage of the transducer model training, the input sequence length is often extended to improve performance, which also increases the memory and computational pressure of the GPU, rendering the original training strategy unsuitable. At present, after training failure (for example, the problem of OOM abnormality occurs in part of computing hardware) is supported by part of the system, the system is automatically restarted and recovered from a check point, but the original training strategy is still used after restarting, so that the repeated occurrence of similar abnormal problems cannot be avoided. Accordingly, there is a need for an improved method of training a model to be able to dynamically perceive the model training state during model training and to adjust the training strategy. Disclosure of Invention One or more embodiments of the present disclosure provide a method and an apparatus for training a large model, so as to dynamically adjust and optimize a training strategy of the large model based on a training state of the large model without interrupting a training flow of the large model. According to a first aspect, there is provided a method of training a large model deployed on a plurality of computing units for distributed training, comprising: Acquiring training state information of the large model in the process of training the large model by adopting a first training strategy, wherein the training state information at least comprises the memory utilization rate of each computing unit; If the memory utilization rate of the plurality of first computing units exceeds the first memory threshold, determining a second training strategy based on the training state information, wherein the second training strategy is used for reducing the memory utilization rate of the plurality of first computing units, and the second training strategy does not change the distributed topology of the large model; And after the specified strategy switching condition is reached, switching the first training strategy into the second training strategy so as to continuously train the large model based on the second training strategy. According to a second aspect, there is provided an apparatus for training a large model deployed on a plurality of computing units for distributed training, the apparatus comprising: The acquisition module is configured to acquire training state information of the large model in the process of training the large