CN-121683921-B - MoE model layer self-adaptive training method, medium and device for maintaining general capability
Abstract
The application discloses a MoE model layer self-adaptive training method, medium and device capable of keeping general capability, and belongs to the technical field of artificial intelligence. The method comprises the steps of constructing a joint loss function integrating supervised fine tuning loss, layer self-adaptive load balancing loss and GRPO reinforcement learning loss based on trust domain constraint. The method comprises the steps of monitoring fine tuning loss, restraining distribution of current model output and reference model output through KL divergence, dynamically calculating load unbalance degree of each MoE layer by layer self-adaptive load balancing loss, distributing self-adaptive punishment weight for each layer according to the load unbalance degree, achieving stronger constraint on layers with higher unbalance degree, and carrying out strategy optimization on GRPO loss by utilizing relative advantages of candidate responses in groups. And finally, synchronously updating the model parameters based on the joint loss function. The method effectively solves the problems that the general capability of the MoE model is easy to degrade in private domain customization, the expert load is unbalanced and the reinforcement learning target is fractured.
Inventors
- FANG JIN
- YANG KAI
- DING XIAOLU
- LIU DONGDONG
Assignees
- 福建博思软件股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260204
Claims (10)
- 1. A method for adaptive training of a MoE model layer that maintains generic capabilities, the method comprising: loading a pre-trained MoE model and initial parameters thereof Fixed as trust domain reference model Preparing a supervisory training data set Input prompt set for reinforcement learning ; In the same training batch, constructing a joint loss function integrating the following three optimization targets , The calculation formula of (2) is as follows: ; Wherein, the And Respectively the adjustable loss weight super parameters, Represents a supervised fine tuning penalty based on trust domain constraints, Representing the overall layer adaptive load balancing penalty, GRPO reinforcement learning penalty based on group relative dominance is represented, Representing parameters of the MoE model to be optimized; The construction mode of (2) is as follows: for training data sets from the supervision Sample data pairs (x, y) e of the medium samples Calculating standard cross entropy loss The calculation formula is as follows: , wherein, Calculating KL divergence between the current model output probability distribution and the output probability distribution of the trust domain reference model, taking the KL divergence as a trust domain constraint term and losing based on the standard cross entropy And said trust domain constraint term computation The calculation formula is as follows: ; Wherein, the For trust domain constraint intensity coefficients, D KL (P||Q) represents the KL divergence of probability distribution P relative to probability distribution Q, Representing the complete probability distribution of the trust domain reference model under the input hint x, Representing the complete probability distribution of the MoE model to be optimized under the input prompt x; The construction mode of (2) is as follows: For the first MoE layer of the MoE model, the MoE layer contains M l expert networks, and the actual load proportion of the ith expert network in a training batch is counted The calculation formula is as follows: Wherein, B is the training batch size, E {0,1} is a routing indicator variable indicating whether the b-th input sample is routed to the i-th expert network of the first layer; Based on actual load ratio Calculating the load imbalance loss of each layer The calculation formula is as follows: ; Calculating the load unbalance index of the layer According to the load unbalance index The adaptive weights w l for this layer are dynamically calculated so that w l follows Monotonically increasing to achieve a stronger load balancing constraint on higher imbalance layers, And w l is calculated as follows: ; ; Wherein, the As a basis weight for the weight of the base, In order to amplify the coefficient of the power, Is the upper limit value of the weight; based on the load unbalance loss of each layer and the corresponding self-adaptive weight, the weighted summation obtains the total layer self-adaptive load balance loss The calculation formula is as follows: ; wherein L represents the total number of MoE layers in the MoE model; The construction mode of (2) is as follows: For a given set of input cues for reinforcement learning The input of which suggests x, from the model, the policy before the last parameter update Mid-sampling to generate K candidate responses And obtain corresponding rewards Calculating an average prize for the group The calculation formula is as follows: relative advantages of constructing candidate responses Relative advantage to the candidate responses Standardized or cut to obtain the relative advantages Calculating probability ratio of current policy to pre-update policy The calculation formula is as follows: ; based on the relative advantages after the treatment Probability ratio The near-end strategy optimized clipping form is adopted to calculate GRPO reinforcement learning loss The calculation formula is as follows: ; Where k represents the candidate response index, Representing clipping threshold superparameter; Based on the joint loss function And synchronously updating all parameters of the MoE model by adopting a gradient optimization algorithm.
- 2. The method of adaptive training of a MoE model layer maintaining generic capabilities of claim 1, wherein relative dominance of each candidate response Standardized or cut to obtain The calculation formula is as follows: ; Wherein, the Representing the presentation to be Cutting to Within the scope of this invention, The standard deviation of the reward r k for all response samples in the current training batch, In order to prevent a small constant from being divided by zero, a max is a preset clipping upper limit value.
- 3. The MoE model layer adaptive training method for maintaining universal capability according to claim 1, wherein the trust domain constraint intensity coefficient β is an adaptive variable depending on a current input prompt x, denoted as β (x), and the value of β (x) depends on the current input prompt x, and the calculation formula is as follows: β(x)=β base +σ·H(π θ (·∣x)); wherein beta base is a basic intensity value, sigma is a scaling factor, and H (pi θ (|x)) is shannon entropy of output probability distribution of the MoE model to be optimized under the input prompt x; the shannon entropy is used for quantifying the uncertainty of model prediction, the higher the entropy value of the shannon entropy is, the larger the uncertainty is represented, and beta (x) can be dynamically adjusted according to the uncertainty of the current prediction of the model through the formula, so that self-adaptive trust domain constraint is applied.
- 4. The MoE model layer adaptive training method for maintaining generic capabilities of claim 1, further comprising the steps of: a batch of data is sampled regularly, and the attribution score of the activation state of each layer of expert network in the MoE model to be optimized to the final rewarding value is calculated through an integral gradient or saprolite value method; identifying expert activation patterns with attribution scores that are continuously positive and above a first preset threshold, marking the expert activation patterns as a positive beneficial pattern set P + , and identifying expert activation patterns with attribution scores that are continuously negative and below a second preset threshold, marking the expert activation patterns as a negative harmful pattern set P - ; Calculating causal guided losses The calculation formula is as follows: ; Wherein, the Represents the frequency of occurrence of expert activation patterns belonging to the set of forward beneficial patterns P + in the current training batch, The frequency of occurrence of expert activation patterns belonging to the negative harmful pattern set P - in the current training batch, And Positive coefficients are used for controlling the encouraging intensity of the positive mode and the inhibiting intensity of the negative mode respectively; The calculation formula of the joint loss function is updated as follows: ; Wherein, the Representing causal guided losses Corresponding loss weight super parameters; by minimizing the joint loss function Enabling to simultaneously encourage the positive beneficial mode and suppress the negative detrimental mode.
- 5. The MoE model layer adaptive training method for maintaining generic capabilities of claim 1, further comprising the steps of: For any two different MoE layers i and p in the MoE model to be optimized, a cross-layer expert activation correlation coefficient matrix C (l,p) is calculated based on the current training batch data, where i is not equal to p, The pearson correlation coefficient of the activation states of the ith expert and the jth expert of the ith layer on the current batch is represented and is used for quantifying the linear correlation strength of the use modes of the two layers of experts; Based on the cross-layer expert activation correlation coefficient matrix C (l,p) , constructing a cross-layer correlation penalty loss The calculation formula is as follows: ; Wherein, delta is a preset positive correlation threshold, the value range is [0,0.5], The function ensures that only strong positive correlations exceeding the threshold delta are penalized, Representing the total number of expert networks at layer p; Penalty loss of the cross-layer dependency And the layer adaptive load balancing penalty Combined, the enhanced load balancing loss is obtained , The calculation formula of (2) is as follows: ; Wherein, gamma is a cross-layer penalty weight coefficient; In calculating the joint loss function When using the enhanced load balancing loss Substitute for original Joint loss function The calculation formula of (a) is updated as follows: 。
- 6. the MoE model layer adaptive training method for maintaining generic capabilities of claim 1, further comprising the steps of: For K candidate responses Each candidate response y k in the list, dividing the candidate response y k into S continuous semantic segments according to the semantic structure of the candidate response y k ; Route indication variable Is changed into semantic segments, and is marked as , Represent the first The first training sample Whether or not the semantic segment is routed to the first Layer 1 A personal expert network; Actual load ratio The formula update of (2) is: ; s is the number of semantic segments divided by each sample; Based on updated Calculating the load imbalance loss of each layer And overall layer adaptive load balancing penalty 。
- 7. The MoE model layer adaptive training method for maintaining generic capabilities of claim 6, further comprising the steps of: For each candidate response y k , the probability ratio And relative advantages The calculation of (1) takes semantic segments as basic units, and specifically comprises the following steps: Computing each semantic segment Fragment level probability ratio of (2) Relative advantage to fragment level The calculation formula is as follows: ; ; Wherein, the Representing semantic segments All of the foregoing has been described above, To semantic segments The rewards to be evaluated are presented, A prize value representing the mth semantic segment; By a preset aggregation function Comparing the segment level probability ratio Aggregate as response level probability ratio And relative advantage of fragment level Aggregation to response level relative advantage The calculation formula is as follows: ; based on the response level probability ratio after aggregation Advantage over response stage Calculating the GRPO reinforcement learning loss by adopting a clipping form optimized by a near-end strategy The calculation formula is as follows: 。
- 8. the MoE model layer adaptive training method for maintaining generic capabilities of claim 1, further comprising the steps of: at the supervised training dataset In each sample data pair (x, y) is marked with its associated task type t, wherein t belongs to a predefined set of task types The task type set T comprises a general knowledge task and a private domain expertise task; in the training process, for the task type t of the input prompt x, the layer is self-adaptive to the weight The task perception adjustment method specifically comprises the following steps: Layer adaptive weights defining task awareness , The calculation formula is as follows: Wherein, the For the task layer importance mapping function, Is a positive scalar for the output according to the task type Adjust the first Importance of layer load balancing loss; The function is configured to execute the following policies: For a general knowledge task, a first importance value is allocated for a shallow MoE layer, a second importance value is allocated for a deep MoE layer, and the first importance value is larger than the second importance value; for private domain expertise tasks, deep MoE layers are allocated Assigning a fourth importance value to the shallow MoE layer, the fourth importance value being a value of The fourth importance value; adaptive load balancing loss at the calculation aggregate layer At this time, for each sample in the batch, according to its task type Using corresponding weights : 。
- 9. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the MoE model layer adaptive training method of maintaining generic capabilities of any one of claims 1 to 8.
- 10. An electronic device having stored thereon a computer program comprising a processor and a storage medium having stored thereon a computer program which, when executed by the processor, implements the MoE model layer adaptive training method of maintaining generic capabilities of any one of claims 1 to 8.
Description
MoE model layer self-adaptive training method, medium and device for maintaining general capability Technical Field The application relates to the technical field of artificial intelligence, in particular to a MoE model layer self-adaptive training method, medium and equipment for keeping universal capability. Background In recent years, large-scale pre-trained language models (LLMs) have made significant progress in natural language processing tasks. Along with the continuous increase of the scale of model parameters, the training and reasoning cost of the traditional dense structure is increased sharply, so that the industry is promoted to widely adopt a mixed expert (MoE) structure. The MoE model introduces a plurality of expert sub-networks in a part layer, and utilizes a gating network to dynamically select and activate part of experts according to input characteristics, so that the model capacity is remarkably improved under controllable calculation cost. In the customization of private capability for MoE large models, the mainstream scheme typically employs a multi-stage pipeline, where first supervised fine-tuning (SFT) is performed on a generic model basis, then reinforcement learning (e.g., PPO, GRPO-like methods) is performed in the private scene, while load balancing assistance loss is superimposed. However, the prior art solutions have the following drawbacks: First, general-purpose capabilities are susceptible to degradation. In the process of fine tuning the private domain, the model parameters are easy to drift greatly, so that the performance of the original general task is reduced. The traditional KL constraint or parameter regularization method does not carry out collaborative design with the gating of MoE and expert structure, so that effective balance between 'keeping general capability' and 'improving private domain capability' is difficult to obtain. Second, expert load imbalance is a prominent issue. In the training process, the MoE model is easy to have the problem that a hot expert is overloaded and a cold expert is idle. The existing load balancing loss generally acts on all MoE layers in a fixed form and with uniform weight, and dynamic and self-adaptive regulation and control cannot be performed according to the unbalance degree of each layer, so that insufficient resource utilization is caused, and the model efficiency and performance are affected. Furthermore, there is a fracture between the training targets. The supervision fine tuning, reinforcement learning and load balancing optimization are usually performed as independent stages or targets, the loss functions thereof compete with each other, and a unified collaborative optimization framework is lacking. The training process is complex, the parameter adjustment is difficult, and the overall optimization of the model universal capability, the private domain performance and the architecture efficiency is difficult to realize. Therefore, there is a need for an integrated training method that can synergistically optimize the above objectives, while improving the private domain capability while effectively maintaining the general capability and optimizing the efficiency of the MoE architecture. Disclosure of Invention In view of the above problems, the application provides a technical scheme for adaptive training of a MoE model layer for maintaining general capability, which is used for solving the problems of easy degradation of general capability, unbalanced expert load and reinforcement learning target cutting in the private domain capability customization process. To achieve the above object, in a first aspect, the present application provides a MoE model layer adaptive training method for maintaining general-purpose capability, the method comprising: loading a pre-trained MoE model and initial parameters thereof Fixed as trust domain reference modelPreparing a supervisory training data setInput prompt set for reinforcement learning; In the same training batch, constructing a joint loss function integrating the following three optimization targets,The calculation formula of (2) is as follows: ; Wherein, the AndRespectively the adjustable loss weight super parameters,Represents a supervised fine tuning penalty based on trust domain constraints,Representing the overall layer adaptive load balancing penalty,GRPO reinforcement learning penalty based on group relative dominance is represented,Representing parameters of the MoE model to be optimized; The construction mode of (2) is as follows: for training data sets from the supervision Sample data pairs (x, y) e of the medium samplesCalculating standard cross entropy lossThe calculation formula is as follows: Where E (x, y) represents taking the average of all samples in the current batch, Calculating KL divergence between the current model output probability distribution and the output probability distribution of the trust domain reference model, taking the KL divergence as a