CN-122021730-A - Model reasoning method and computing equipment of hybrid expert model

CN122021730ACN 122021730 ACN122021730 ACN 122021730ACN-122021730-A

Abstract

The embodiment of the specification provides a model reasoning method and computing equipment of a hybrid expert MoE model, wherein the MoE model comprises a first MoE layer and a second MoE layer, the first MoE layer comprises a first joint router and a plurality of first expert networks, the second MoE layer comprises a plurality of second expert networks, in the reasoning process of the MoE model on a reasoning request, a first joint route result is generated by the first joint router based on input vectors of the first MoE layer, the input vectors of the first MoE layer comprise first feature vectors corresponding to target data of the input MoE model, the first joint route result comprises a first layer expert set and a second layer expert set corresponding to the target data, the first feature vectors of the target data are processed based on the plurality of first expert networks indicated by the first layer expert set, and target expert parameters are loaded into a memory, and the target expert parameters comprise parameters of the second expert networks indicated by the second layer expert sets, which are not stored in the memory.

Inventors

ZHAO SHANWEI
ZHU SHIAI

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260214

Claims (14)

1. A model reasoning method of a hybrid expert MoE model, the MoE model comprising a first MoE layer and a second MoE layer, the second MoE layer being connected after the first MoE layer, the first MoE layer comprising a first joint router and a plurality of first expert networks, the second MoE layer comprising a plurality of second expert networks, the method comprising: In the reasoning process of the MoE model on the reasoning request, generating a first joint routing result by the first joint router based on the input vector of the first MoE layer; The input vector of the first MoE layer includes a first feature vector corresponding to target data input into the MoE model, the first joint routing result includes a first layer expert set corresponding to target data and a second layer expert set, the first layer expert set is used for indicating a plurality of first expert networks in the first MoE layer for processing the target data, and the second layer expert set is used for indicating a plurality of second expert networks in the second MoE layer to be loaded for processing the target data; Processing a first feature vector of the target data based on a plurality of first expert networks indicated by the first layer expert set, and loading target expert parameters into a memory, the target expert parameters including expert parameters of a second expert network indicated by the second layer expert set that are not stored in the memory.
2. The method of claim 1, wherein the second MoE layer further comprises a second syndication router, the method further comprising: in the reasoning process of the MoE model on the reasoning request, generating a second joint route result by the second joint router based on the input vector of the second MoE layer; The input vector of the second MoE layer includes a second feature vector corresponding to target data input into the MoE model, and the second joint route result includes a target expert set corresponding to the target data, where the target expert set is used to instruct a plurality of second expert networks in the second MoE layer for processing the target data; And responding to a plurality of second expert networks indicated by the target expert set stored in the memory, and processing a second feature vector of the target data based on the plurality of second expert networks.
3. The method of claim 1, wherein the first joint router is trained to: In the reasoning process of a teacher model on a training request, a second routing result of a second router of a second teacher MoE layer is obtained, the teacher model is a model of a MoE architecture, the teacher model comprises a first teacher MoE layer and a second teacher MoE layer, the second teacher MoE layer is connected behind the first teacher MoE layer, the first teacher MoE layer comprises a first router and a plurality of first expert networks, the second teacher MoE layer comprises a second router and a plurality of second expert networks, the second routing result comprises a second teacher layer expert set corresponding to training target data, and the second teacher layer expert set is used for indicating a plurality of second expert networks used for processing the training target data in the second teacher MoE layer; constructing a training label corresponding to the first MoE layer according to the second routing result; In the reasoning process of the MoE model on the training request, an expert prediction result of a cross-layer prediction branch network in a first joint router of the first MoE layer is obtained, wherein the expert prediction result comprises a second-layer expert set corresponding to training target data, and the second-layer expert set corresponding to the training target data is used for indicating a plurality of second expert networks in the second MoE layer for processing the training target data; And adjusting parameters of a cross-layer prediction branch network in the joint router based on the difference between the training label corresponding to the first MoE layer and the expert prediction result.
4. The method of claim 1, wherein the first joint routing result further includes a quantization precision corresponding to an expert parameter of each second expert network in the second layer expert set, the quantization precision representing a number of bits required for each expert parameter, the loading the target expert parameter into memory, comprising: and loading the target expert parameters after quantization processing with the quantization precision into a memory.
5. The method of claim 4, wherein the second MoE layer further comprises a second syndication router, the method further comprising: in the reasoning process of the MoE model on the reasoning request, generating a second joint route result by the second joint router based on the input vector of the second MoE layer; The input vector of the second MoE layer includes a second feature vector corresponding to target data input into the MoE model, and the second joint route result includes a target expert set corresponding to the target data, where the target expert set is used to instruct a plurality of second expert networks in the second MoE layer for processing the target data; And responding to the plurality of second expert networks indicated by the target expert set stored in the memory, and processing a second feature vector of the target data based on the plurality of second expert networks corresponding to the quantization precision.
6. The method of claim 4, wherein the first joint router is trained to: In the reasoning process of a teacher model on a training request, a second routing result of a second router of a second teacher MoE layer is obtained, the teacher model is a model of a MoE architecture, the teacher model comprises a first teacher MoE layer and a second teacher MoE layer, the second teacher MoE layer is connected behind the first teacher MoE layer, the first teacher MoE layer comprises a first router and a plurality of first expert networks, the second teacher MoE layer comprises a second router and a plurality of second expert networks, the second routing result comprises a second teacher layer expert set corresponding to training target data, the second teacher layer expert set is used for indicating a plurality of second expert networks used for processing the training target data in the second teacher MoE layer, and expert parameters of the expert networks in the teacher model adopt bit numbers before quantization processing; a plurality of second expert networks indicated by a second teacher layer expert set corresponding to the training target data process the feature vectors corresponding to the training target data to obtain teacher output distribution; performing reasoning on the training request by using the MoE model to obtain a precision prediction result of a quantization precision branch in a first joint router of the first MoE layer, wherein the precision prediction result comprises quantization precision of a second layer expert set corresponding to training target data, and the quantization precision of the second layer expert set is used for indicating quantization precision of expert parameters of a plurality of second expert networks used for processing the training target data in the second MoE layer; Processing the feature vector corresponding to the training target data by a plurality of second expert networks under the quantization precision of a second layer expert set corresponding to the training target data to obtain quantization output distribution; And adjusting parameters of a quantization precision branch network in the first joint router based on the difference between the teacher output distribution and the quantization output distribution.
7. The method of claim 6, wherein the adjusting parameters of the quantization accuracy branch network in the first joint router based on a difference between the teacher output distribution and the quantization output distribution comprises: calculating a classification loss function based on a difference between the teacher output distribution and the quantized output distribution; Calculating a regularization penalty term according to the bit number of the quantization precision in the precision prediction result, wherein the size of the regularization penalty term is positively correlated with the bit number of the bit number; and adjusting parameters of the quantization precision branch network in the first joint router based on the classification loss function and the regularization penalty term.
8. The method of claim 4, wherein the first joint router is trained to: Acquiring quantization precision labels of a second-layer expert set in the MoE model, wherein the second-layer expert set is used for indicating a plurality of second expert networks in a second MoE layer for processing training target data; performing reasoning on the training request by using the MoE model to obtain a quantization prediction result of a quantization precision branch in a first joint router of a first MoE layer, wherein the quantization prediction result comprises quantization precision of a second layer expert set corresponding to training target data, and the quantization precision of the second layer expert set is used for indicating quantization precision of expert parameters of a plurality of second expert networks used for processing the training target data in the second MoE layer; And adjusting parameters of the quantization precision branch network in the first joint router based on the difference between the quantization precision label and the quantization prediction result.
9. The method of claim 8, wherein the obtaining quantization accuracy tags for a second layer of expert sets in the MoE model comprises: In the process of reasoning a training request by a teacher model, acquiring a second routing result of a second router of a second teacher MoE layer, and acquiring an expert output result of the second teacher MoE layer, wherein the teacher model is a model of a MoE architecture and comprises a first teacher MoE layer and the second teacher MoE layer, the second teacher MoE layer is connected behind the first teacher MoE layer, the first teacher MoE layer comprises a first router and a plurality of first expert networks, the second teacher MoE layer comprises a second router and a plurality of second expert networks, the second routing result comprises a second teacher layer expert set corresponding to training target data, the second teacher layer expert set is used for indicating a plurality of second expert networks in the second teacher MoE layer for processing the training target data, the expert output result is an output result after the second layer expert set executes reasoning, and the expert parameters in the second teacher layer expert set adopt the bit number before processing of the expert network; Loading expert parameters after quantization processing with candidate quantization precision for a target expert network in the second-layer expert set, and keeping expert parameters before quantization processing for a plurality of expert networks outside the target expert network in the second-layer expert set, wherein the target expert network is any one of a plurality of second expert networks indicated by the second teacher-layer expert set; performing reasoning based on the loaded second-layer expert set to obtain a quantized output result; Determining the lowest quantization precision which corresponds to the target expert network and meets the preset quality requirement from a plurality of candidate quantization precision based on the difference between the quantization output result and the output quality of the expert output result; And determining the quantization precision label of the second layer expert set in the MoE model according to the minimum quantization precision.
10. The method of claim 9, wherein the obtaining the second routing result of the second router of the second teacher MoE layer and obtaining the expert output result of the second teacher MoE layer during the reasoning process of the teacher model on the training request further comprises: obtaining a predicted routing result of a router of the teacher MoE layer after the second teacher MoE layer; After the reasoning is performed based on the loaded second-layer expert set to obtain a quantized output result, the method further comprises: Obtaining a predicted and quantized routing result of a router of the teacher MoE layer after the loaded second teacher MoE layer; determining the minimum quantization precision which corresponds to the target expert network and meets the preset routing requirement from a plurality of candidate quantization precision based on the difference between the prediction routing result and the prediction quantization routing result; and determining the quantization precision label of the second layer expert set in the MoE model according to the minimum quantization precision, wherein the method comprises the following steps: And determining a quantization precision label of a second layer of expert set in the MoE model according to the minimum quantization precision and the minimum quantization precision.
11. The method of claim 1, wherein loading the target expert parameters into the memory comprises: And under the condition that the expert parameters of the second expert network indicated by the second layer of expert set are not stored in the memory, loading the expert parameters of the second expert network into the memory.
12. The method of claim 1, wherein the method further comprises: for the stored expert networks in the memory, acquiring the priority of each expert network, wherein the priority is positively correlated with the preloading hit probability of the expert network; And reserving the expert network with the priority meeting the preset requirement in the memory, and removing the expert network with the priority not meeting the preset requirement.
13. The method of claim 1, wherein the MoE model comprises a plurality of fransformer layers including an attention layer and a MoE layer, the input vector of the first MoE layer being the output of the attention layer of the fransformer layer in which the first MoE layer is located.
14. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-13.

Description

Model reasoning method and computing equipment of hybrid expert model Technical Field The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a model reasoning method and computing equipment of a hybrid expert model. Background The hybrid expert model (Mixture of Experts, moE) is an important architecture for large language models (Large Language Model, LLM) that enables efficient expansion of model scale by replacing Feed-Forward networks (FFNs) in the neural Network layer with a collection of multiple expert networks. In the MoE architecture, each token activates only a small number of expert networks (typically Top-K expert networks selected by the router), and the remaining expert networks remain dormant, thereby greatly increasing the model capacity without significantly increasing the computational effort. This "sparse activation" feature enables the MoE model to be extended to billions of parameter sizes while maintaining acceptable computational overhead. However, when deploying the inference engine of the large-scale hybrid expert model in a scenario where computing resources such as an end side, an edge device, or an edge server (e.g., a single card, a small GPU) are limited, it is generally not possible to simultaneously reside all expert parameters in the model in the high-speed executable memory due to the available memory capacity, memory bandwidth, and storage hierarchy of the terminal device. For example, the total parameter size of Mixtral-8×7b models is about 87GB, where the weight parameters of the expert network are about 84GB with a duty ratio exceeding 96%, and the available memory or unified memory size of a typical end-side or edge device is usually only 8 GB-32 GB, which makes it difficult to support the full amount of expert network residences. Therefore, on-line reasoning of the end-side MoE usually adopts an expert offloading (expert offloading) operation mode, stores expert parameters which are not used by the current reasoning step in a low-speed storage layer (such as a system memory or an external storage), and loads the expert parameters into an executable memory as required. Under expert offloading conditions, expert loading delay often becomes the dominant factor in MoE reasoning performance. To reduce the impact of loading on reasoning latency, existing end-side MoE systems typically introduce expert pre-fetching (Expert Prefetching) mechanisms to load previously, based on historical routing statistics (i.e., routing results over a period of time), experts that may be activated in the future when current reasoning has not been completed, to hide part of the I/O (Input/Output) overhead by overlapping the load computation. However, in the mode based on the historical route statistics, it is difficult to accurately predict the expert to be activated in the future reasoning process. Therefore, a more accurate and efficient expert prefetch strategy is needed to further reduce the loading overhead and overall latency of end-side MoE reasoning. Disclosure of Invention The embodiment of the specification provides a model reasoning scheme of a hybrid expert model, which can effectively improve the reasoning speed of the MoE model by predicting the expert network to be activated by a future MoE layer and loading parameters of the expert network in advance. In a first aspect, an embodiment of the present disclosure provides a model reasoning method of a hybrid expert model, where the MoE model includes a first MoE layer and a second MoE layer, the second MoE layer is connected to the first MoE layer, the first MoE layer includes a first joint router and a plurality of first expert networks, the second MoE layer includes a plurality of second expert networks, in a reasoning process of the MoE model on a reasoning request, a first joint routing result is generated based on an input vector of the first MoE layer by the first joint router, the input vector of the first MoE layer includes a first feature vector corresponding to target data of the input MoE model, the first joint routing result includes a first layer expert set corresponding to the target data and a second layer expert set, the first layer expert set is used for indicating a plurality of first expert networks for processing the target data in the first MoE layer, the second layer is used for indicating a plurality of second expert networks for processing the target data in the second MoE layer to be loaded, the first layer is used for indicating a plurality of expert networks for processing the target data based on the first layer expert sets, and the first expert networks indicate the first expert networks do not process the target data, and the first expert networks indicate the first expert networks. In some embodiments, the second MoE layer further comprises a second joint router, the method further comprises the step of generating a second joint route result by