CN-122021804-A - Method for training a hybrid expert model, associated device and computer program product

CN122021804ACN 122021804 ACN122021804 ACN 122021804ACN-122021804-A

Abstract

The disclosure provides a method, a related device and a computer program product for training a hybrid expert model, and relates to the technical field of artificial intelligence such as deep learning, gating mechanisms, word element distribution and the like. The method comprises the steps of determining basic affinities of words to be allocated in each expert sub-model in a mixed expert model, determining affinity adjustment parameters corresponding to the expert sub-models respectively based on current model capacity values of the expert sub-models, allocating the words to be allocated to the corresponding expert sub-models based on the basic affinities and the affinity adjustment parameters, and training the expert sub-models correspondingly by using the words allocated to the expert sub-models to obtain a trained mixed expert model so as to process target scene data through the trained mixed expert model. Therefore, the word elements used for training can be allocated to the expert sub-models in the mixed expert model more efficiently and reasonably, so that the mixed expert model can be trained more efficiently and with higher quality.

Inventors

Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (17)

1. A method of training a hybrid expert model, comprising: determining the basic affinity of each expert sub-model in the mixed expert model to the word elements to be allocated; Determining affinity adjustment parameters corresponding to the expert sub-models respectively based on the current model capacity values of the expert sub-models; Based on the basic affinity and the affinity adjustment parameters, assigning the to-be-assigned word elements to corresponding expert sub-models; and training the expert sub-model correspondingly by using the word elements distributed to the expert sub-model to obtain a trained mixed expert model, so as to process target scene data through the trained mixed expert model.
2. The method of claim 1, wherein determining affinity adjustment parameters corresponding to each of the expert sub-models based on current model capability values of the expert sub-models comprises: Adding the basic affinities to obtain a total basic affinity; Calculating an average model capability value based on the total base affinity and the number of expert sub-models; for each expert sub-model, determining the basic affinity of the expert sub-model to the word elements to be allocated as the current model capacity value of the expert sub-model; calculating a difference of the average model capacity value minus the current model capacity value for each expert sub-model to obtain a capacity difference; and determining the affinity adjustment parameters corresponding to the expert sub-model based on the capability difference value.
3. The method of claim 2, wherein determining the affinity adjustment parameter corresponding to the expert sub-model based on the capability difference value comprises: And determining an affinity adjustment parameter corresponding to the expert sub-model based on the product of the capability difference and a preset first coefficient in response to the capability difference being greater than or equal to 0, wherein the first coefficient is not 0.
4. The method of claim 2, wherein determining the affinity adjustment parameter corresponding to the expert sub-model based on the capability difference value comprises: and in response to the capability difference value being less than 0, determining that the value of the affinity adjustment parameter corresponding to the expert sub-model is 0.
5. A method according to claim 3, wherein determining an affinity adjustment parameter corresponding to the expert sub-model based on the product of the capability difference value and a preset first coefficient comprises: And responding to training of the mixed expert model, wherein the training comprises at least two training rounds, the affinity adjustment parameters of the expert sub-model corresponding to the current training round are determined based on the product of the capability difference and a preset first coefficient, and the affinity adjustment parameters of the expert sub-model corresponding to the last training round, wherein the first coefficient is determined based on the dimension difference of the capability difference and the affinity adjustment parameters of the history, and at least one history training round exists before the current training round.
6. The method of claim 2, wherein determining the affinity adjustment parameter corresponding to the expert sub-model based on the capability difference value comprises: And responding to training of the mixed expert model, wherein the training of the mixed expert model at least comprises two training rounds, and determining a round affinity adjustment parameter corresponding to the expert sub model in the current training round based on a capability difference value of the expert sub model in the current training round and a historical word element load value of the expert sub model in the last training round, wherein at least one historical training round exists before the current training round.
7. The method as recited in claim 6, further comprising: Adding the historical vocabulary element load values of each expert sub-model in the previous training round to obtain a total historical vocabulary element load value; Calculating an average historical vocabulary loading value based on the total historical vocabulary loading value and the number of expert sub-models; Correspondingly, determining the round affinity adjustment parameter corresponding to the expert sub-model in the current training round based on the capability difference value of the expert sub-model in the current training round and the historical word load value of the last training round comprises the following steps: And determining a round affinity adjustment parameter corresponding to the expert sub-model in the current training round based on the product of the capability difference value of the current training round and a preset second coefficient in response to the fact that the historical vocabulary load value of the expert sub-model in the previous training round is larger than the average historical vocabulary load value, wherein the second coefficient is not 0.
8. The method as recited in claim 6, further comprising: Adding the historical vocabulary element load values of each expert sub-model in the previous training round to obtain a total historical vocabulary element load value; Calculating an average historical vocabulary loading value based on the total historical vocabulary loading value and the number of expert sub-models; Correspondingly, determining the round affinity adjustment parameter corresponding to the expert sub-model in the current training round based on the capability difference value of the expert sub-model in the current training round and the historical word load value of the last training round comprises the following steps: And determining a round affinity adjustment parameter corresponding to the expert sub-model in the current training round based on the product of the capability difference value of the current training round and a preset third coefficient, wherein the historical lexeme load value of the expert sub-model in the previous training round is smaller than or equal to the average historical lexeme load value, and the third coefficient is the opposite number of the second coefficient.
9. The method of claim 2, wherein determining the affinity adjustment parameter corresponding to the expert sub-model based on the capability difference value comprises: And responding to the training of the mixed expert model, wherein the training of the mixed expert model at least comprises two training rounds, and the affinity adjustment parameters of the expert sub model corresponding to the current training round are determined based on the capability difference value of the expert sub model in the current training round and the affinity adjustment parameters of the expert sub model corresponding to the last training round, wherein at least one historical training round exists before the current training round.
10. The method of claim 9, wherein determining the round affinity adjustment parameters for the expert sub-model for the current training round based on the expert sub-model's ability difference for the current training round and the historical affinity adjustment parameters for the previous training round comprises: Calculating the product of the historical affinity adjustment parameter corresponding to the previous training round and a preset fourth coefficient of the expert sub-model to obtain a product result, wherein the fourth coefficient is not 0; and determining a round affinity adjustment parameter corresponding to the expert sub-model in the current training round based on the capability difference value of the expert sub-model in the current training round and the product result.
11. The method of claim 2, wherein determining the affinity adjustment parameter corresponding to the expert sub-model based on the capability difference value comprises: And responding to training of the mixed expert model, wherein the training of the mixed expert model at least comprises two training rounds, and determining the affinity adjustment parameter of the expert sub model corresponding to the current training round based on the capability difference value of the expert sub model in the current training round, the historical affinity adjustment parameter corresponding to the last training round and the historical word element load value of the last training round, wherein at least one historical training round exists before the current training round.
12. The method according to any one of claims 5-11, wherein said assigning said tokens to be assigned to respective expert sub-models based on said base affinities and said affinity adjustment parameters comprises: and distributing the to-be-distributed word elements to corresponding expert sub-models based on the basic affinity and the round affinity adjustment parameters.
13. An apparatus for training a hybrid expert model, comprising: A basic affinity determining unit configured to determine a basic affinity of each expert sub-model in the mixed expert model for the word to be assigned; an adjustment parameter determination unit configured to determine affinity adjustment parameters corresponding to the expert sub-models, respectively, based on current model capability values of the expert sub-models; A word element allocation unit configured to allocate the word elements to be allocated to the corresponding expert sub-models based on the basic affinity and the affinity adjustment parameter; And the model training unit is configured to correspondingly train the expert sub-models by using the word elements distributed to the expert sub-models to obtain a trained mixed expert model so as to process target scene data through the trained mixed expert model.
14. A hybrid expert model comprising a plurality of expert sub-models, wherein the plurality of expert sub-models are trained via the method of any of claims 1-12.
15. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training a hybrid expert model of any of claims 1-12.
16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of training a hybrid expert model of any of claims 1-12.
17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the method of training a hybrid expert model according to any of claims 1-12.

Description

Method for training a hybrid expert model, associated device and computer program product Technical Field The present disclosure relates to the field of computer technology, and in particular, to the field of artificial intelligence technologies such as deep learning, gating mechanisms, word element assignment, and the like, and more particularly, to a method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product for training a hybrid expert model. Background With the rapid development of artificial intelligence and deep learning technologies, a neural network-based hybrid expert model (Mixture of Experts, abbreviated as MOE) is becoming an important choice for processing complex tasks. Because the MOE itself is a model made up of multiple distributions, often used to model different categories, populations, or patterns of data, the MOE can accomplish tasks through separate expert sub-models (Expert Submodel) partitioned therein. Expert sub-models, in MOEs, can be considered as some sort of "expert" and are used, responsible for processing specific patterns or regions in the data to obtain a representation of the corresponding population or component in the data set. Accordingly, due to the MOE processing mode, the MOE has higher flexibility and accuracy under different tasks or data modes compared with the traditional model. Thus, it is interesting and urgent to train MOEs more efficiently and more qualitatively, and in particular to train expert sub-networks, expert sub-models therein, for the purpose of more efficient use of MOEs. Disclosure of Invention Embodiments of the present disclosure provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for training a hybrid expert model. According to a first aspect, an embodiment of the disclosure provides a method for training a hybrid expert model, which comprises the steps of determining basic affinities of words to be allocated for each expert sub-model in the hybrid expert model, determining affinity adjustment parameters corresponding to the expert sub-models respectively based on current model capacity values of the expert sub-models, distributing the words to be allocated to the corresponding expert sub-models based on the basic affinities and the affinity adjustment parameters, and training the expert sub-models correspondingly by using the words allocated to the expert sub-models to obtain a trained hybrid expert model so as to process target scene data through the trained hybrid expert model. In a second aspect, an embodiment of the disclosure provides an apparatus for training a hybrid expert model, which includes a basic affinity determining unit configured to determine a basic affinity of each expert sub-model in the hybrid expert model for a word to be assigned, an adjustment parameter determining unit configured to determine affinity adjustment parameters corresponding to the expert sub-models, respectively, based on current model capability values of the expert sub-models, a word assignment unit configured to assign the word to be assigned to the corresponding expert sub-model based on the basic affinity and the affinity adjustment parameters, and a model training unit configured to correspondingly train the expert sub-model with the word assigned to the expert sub-model to obtain a trained hybrid expert model, so as to process target scene data through the trained hybrid expert model. In a third aspect, embodiments of the present disclosure propose a hybrid expert model comprising a plurality of expert sub-models, the plurality of expert sub-models being trained via the method of training the hybrid expert model described in any of the implementations of the first aspect. In a fourth aspect, an embodiment of the present disclosure provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor, when executed, to implement a method of training a hybrid expert model as described in any of the implementations of the first aspect. In a fifth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement a method of training a hybrid expert model as described in any of the implementations of the first aspect when executed. In a sixth aspect, the presently disclosed embodiments provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing a method of training a hybrid expert model as described in any of the implementations of the first aspect. The embodiment of the disclosure provides a method, a device, an electronic device, a computer readable storage medium and a computer program product f