CN-121981164-A - MoE model training method, device, equipment and storage medium

CN121981164ACN 121981164 ACN121981164 ACN 121981164ACN-121981164-A

Abstract

The application provides a method, a device, equipment and a storage medium for training a MoE model, which belong to the field of neural networks; the method comprises the steps of carrying out network splitting and network group construction on a feedforward network layer of a Dense model to obtain a target network group, copying the target network group to obtain a plurality of target network groups, carrying out MoE model construction according to the plurality of target network groups and the Dense model to obtain candidate MoE models, and carrying out model training on the candidate MoE models through a preset sample data set until a converged target MoE model is obtained. The method can accurately obtain the converged target MoE model, greatly improves the accuracy and efficiency of the MoE model training, and reduces the cost of the model training.

Inventors

XUAN WENFENG
LI SONGRU

Assignees

深圳元智信息技术开发有限公司

Dates

Publication Date: 20260505
Application Date: 20251215

Claims (10)

1. A method of MoE model training, comprising: obtaining a Dense model, wherein the Dense model is a trained and converged model; Network splitting and network group construction are carried out on the feedforward network layer of the Dense model to obtain a target network group, and the target network group is duplicated to obtain a plurality of target network groups; Performing MoE model construction according to the target network groups and the Dense model to obtain candidate MoE models; And carrying out model training on the candidate MoE models through a preset sample data set until a converged target MoE model is obtained.
2. The MoE model training method as claimed in claim 1, wherein said obtaining a target network group by performing network splitting and network group construction on a feedforward network layer of the Dense model includes: Acquiring network splitting granularity, and carrying out network uniform splitting on the feedforward network layer according to the network splitting granularity to obtain a plurality of sub-networks; and combining the plurality of sub-networks to obtain a target network group.
3. The MoE model training method of claim 2, wherein the obtaining the network split granularity comprises: Acquiring preset condition information, wherein the preset condition information at least comprises one item of equipment hardware resource information and task demand condition information; Acquiring a mapping relation table between preset network splitting granularity and preset condition information; inquiring the network splitting granularity matched with the preset condition information from the mapping relation table.
4. The MoE model training method of claim 1, wherein the copying the target network group to obtain a plurality of target network groups comprises: Acquiring preset condition information, wherein the preset condition information comprises equipment hardware resource information and/or task demand condition information; Determining the copy times of the target network group according to the preset condition information; and copying the copying times of the target network group to obtain a plurality of target network groups.
5. The method for training a MoE model according to claim 1, wherein said constructing a MoE model according to the plurality of target network groups and the Dense model to obtain candidate MoE models comprises: And replacing the feedforward network layer of the Dense model according to the plurality of target network groups, and adding a router network to obtain candidate MoE models, wherein each expert network in the candidate MoE models is a sub-network of the feedforward network layer of the Dense model.
6. The method of claim 1, wherein the predetermined sample data set comprises a predetermined first sample data set, the analysis model is converged based on the predetermined first sample data set, the model training is performed on the candidate MoE model by the predetermined sample data set until a converged target MoE model is obtained, and the method comprises: and carrying out model training on the candidate MoE models through the preset first sample data set until the converged target MoE model is obtained.
7. The method of claim 1, wherein the predetermined sample data set comprises a predetermined first sample data set and a predetermined second sample data set, the predetermined second sample data set and the predetermined first sample data set are used for training different domain data models, the training the candidate MoE model by the predetermined sample data set until a converged target MoE model is obtained comprises: model training is carried out on the candidate MoE models through the preset first sample data set until the converged candidate MoE models are obtained; and carrying out model training on the converged candidate MoE model through the preset second sample data set until the converged target MoE model is obtained.
8. The MoE model training device is characterized by comprising an acquisition module, a generation module, a model construction module and a training model, wherein: The acquisition module is used for acquiring a Dense model, wherein the Dense model is a trained and converged model; the generation module is used for obtaining a target network group by carrying out network splitting and network group construction on a feedforward network layer of the Dense model, and copying the target network group to obtain a plurality of target network groups; The model construction module is used for carrying out MoE model construction according to the plurality of target network groups and the Dense model to obtain candidate MoE models; and the training module is used for carrying out model training on the candidate MoE models through a preset sample data set until a converged target MoE model is obtained.
9. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the MoE model training method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the MoE model training method according to any of the claims 1 to 7.

Description

MoE model training method, device, equipment and storage medium Technical Field The present application relates to the field of neural networks, and in particular, to a method, apparatus, device, and storage medium for training a MoE model. Background With popularization and commercialization of large-scale neural network models, the scale and accuracy of the models are further improved by a hybrid expert (Mixture of Experts, moE) model, the frequency of using the MoE model in a real scene is higher and higher, the MoE model is a sparse activation architecture, the FFN layer of the model framework is composed of a plurality of parallel sub-networks, each sub-network is called an expert, and an expert sub-network participating in calculation is selected through a Router (Router) module, namely, in each forward or reverse calculation of the model network, only a part of selected parameters (specific experts) participate in calculation, and the expert which is not selected does not participate in calculation, so that efficient calculation is realized through division cooperation. At present, in order to pursue the prediction effect of the MoE model, a large amount of sample data is required for model pre-training, but as the MoE model increases, the cost of model pre-training increases, wherein the training period increases exponentially, and the training cost also increases. Therefore, how to train the MoE model accurately at low cost is a current urgent problem to be solved. Disclosure of Invention The application mainly aims to provide a method, a device, equipment and a storage medium for training a MoE model, which aim to improve the accuracy and efficiency of the training of the MoE model and reduce the cost of the training of the model. In a first aspect, the present application provides a MoE model training method, including the steps of: obtaining a Dense model, wherein the Dense model is a trained and converged model; Network splitting and network group construction are carried out on the feedforward network layer of the Dense model to obtain a target network group, and the target network group is duplicated to obtain a plurality of target network groups; Performing MoE model construction according to the target network groups and the Dense model to obtain candidate MoE models; And carrying out model training on the candidate MoE models through a preset sample data set until a converged target MoE model is obtained. In a second aspect, the present application further provides a MoE model training apparatus, where the MoE model training apparatus includes an acquisition module, a generation module, a model construction module, and a training model, where: The acquisition module is used for acquiring a Dense model, wherein the Dense model is a trained and converged model; the generation module is used for obtaining a target network group by carrying out network splitting and network group construction on a feedforward network layer of the Dense model, and copying the target network group to obtain a plurality of target network groups; The model construction module is used for carrying out MoE model construction according to the plurality of target network groups and the Dense model to obtain candidate MoE models; and the training module is used for carrying out model training on the candidate MoE models through a preset sample data set until a converged target MoE model is obtained. In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the MoE model training method as described above. In a fourth aspect, the present application also provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the MoE model training method as described above. The application provides a method, a device, equipment and a storage medium for training a MoE model, which are used for acquiring the Dense model, wherein the Dense model is a trained and converged model, a target network group is obtained by carrying out network splitting and network group construction on a feedforward network layer of the Dense model, the target network group is duplicated to obtain a plurality of target network groups, then the MoE model construction is carried out according to the plurality of target network groups and the Dense model to obtain a candidate MoE model, and the model training is carried out on the candidate MoE model through a preset sample data set until the converged target MoE model is obtained. According to the method, the feedforward network layer of the trained and converged Dense model is subjected to network splitting and network group construction to obtain a plurality of target network groups for MoE