CN-122021951-A - MoE reasoning-oriented self-adaptive calculation fusion Top-k route scheduling method

CN122021951ACN 122021951 ACN122021951 ACN 122021951ACN-122021951-A

Abstract

The invention discloses a self-adaptive calculation fusion Top-k route scheduling method oriented to MoE reasoning. And obtaining gating scores of each expert, and taking the sum of the top K gating scores after sequencing as the word element confidence. If the confidence coefficient is higher than the threshold value, directly distributing the top K experts with the highest score, otherwise, constructing a candidate expert set comprising a plurality of experts with the gating score ranked at the front, comprehensively considering the gating weight, the network communication bandwidth resource and the GPU computing power resource, ranking the candidate experts, and selecting the top K to complete word element distribution. According to the method, the multi-dimensional ordering is integrated through self-adaptive shunting and calculation, so that the load unbalance and communication congestion are solved, the reasoning time delay is reduced, the throughput is improved, and the accuracy and performance trade-off is supported.

Inventors

SHEN GANGXIANG
LIN JINXIANG

Assignees

苏州大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A self-adaptive calculation fusion Top-k route scheduling method facing MoE reasoning is characterized by comprising the following steps: acquiring gating scores of all the experts obtained by computing the input word elements through a gating network, and determining the confidence coefficient of the word elements according to the gating scores, wherein the confidence coefficient is used for measuring the matching certainty of the word elements and each expert; And executing routing scheduling based on the confidence degree of the word element, wherein the routing scheduling comprises the following steps of: If the confidence coefficient is higher than a preset threshold value, assigning the word elements to the first K experts with the highest gating score, wherein K is the number of preset word element activation experts; If the confidence coefficient is not higher than a preset threshold value, a candidate expert set comprising a plurality of experts with front gating score sorting is constructed, the communication resource state and the computing power resource state of the computing nodes where the experts in the candidate expert set are located are obtained, the experts in the candidate expert set are sorted based on the communication resource state and the computing power resource state, the front K experts are selected from sorting results, and the word elements are distributed to the computing nodes corresponding to the front K experts.
2. The self-adaptive calculation fusion Top-K route scheduling method for MoE reasoning of claim 1, wherein the method for determining the confidence level of the word element according to the gating scores is characterized in that the gating scores of all the word elements corresponding to the experts are sorted according to descending order, the accumulated sum of the K gating scores after sorting is calculated to obtain the Top-K quality score M k of the word element, and the Top-K quality score M k is used as the confidence level for measuring the certainty of matching the word element with each expert.
3. The MoE reasoning-oriented self-adaptive calculation fusion Top-K route scheduling method is characterized in that a candidate expert set is constructed by selecting the first m experts from an expert list to construct a candidate expert set after gating scores are sorted in descending order, wherein m is the size of the preset candidate expert set, m > K is met, m does not exceed the total number of the experts, and the candidate expert set comprises gating scores of each candidate expert, the identification of the GPU node and the number of the allocated tokens.
4. The MoE reasoning-oriented self-adaptive calculation fusion Top-k route scheduling method of claim 1, wherein the communication resource state comprises intelligent center physical network topology information and communication available bandwidth information between GPU nodes where the word elements are located and GPU nodes where candidate experts are located; The method for acquiring the communication resource state comprises the steps of establishing a mapping relation from expert identifications to GPU node identifications according to deployment relation between each expert and GPU node, establishing a physical network topology mapping relation according to server identifications, rack identifications and interconnection link types of the GPU node, determining a communication path between the two in the physical network topology mapping relation according to the GPU node identifications of the word elements and the GPU node identifications of candidate experts, and determining available bandwidth information of the path according to total bandwidth of each link and occupied bandwidth of current statistics on the communication path from the GPU node of the word elements to the GPU node of the candidate expert, wherein the available bandwidth is the minimum value of residual bandwidth of each link on the communication path.
5. The MoE reasoning-oriented self-adaptive calculation fusion Top-k route scheduling method is characterized in that the calculation resource state is the number of allocated tokens of GPU nodes where candidate experts are located; The method for acquiring the computing power resource state comprises the steps of maintaining a word count value to be allocated for each GPU node, and adding one to the allocated word count value of the GPU node when the word is allocated to the corresponding GPU node.
6. The MoE reasoning-oriented self-adaptive calculation fusion Top-K route scheduling method is characterized in that based on the communication resource state and the computing power resource state, the method for ordering the experts in the candidate expert set is that the candidate experts are ordered according to the fact that the available communication bandwidth of the node where the word element is located and the node where the candidate expert is located is from large to small, the larger the available bandwidth is, the stronger the data transmission capacity of the current communication path is indicated, the more front the ordering is, the candidate experts with the same available communication bandwidth are ordered according to the fact that the number of the allocated word elements of the GPU node where the candidate experts are located is from small to large, the fewer the number of the allocated word elements is ordered, the more front the ordering is, after ordering is completed, the front K experts are selected as target experts, and the word elements are allocated to computing nodes corresponding to the front K experts.
7. The MoE reasoning-oriented self-adaptive calculation fusion Top-K route scheduling method is characterized in that the preset threshold is a high quality threshold S, the value range of S is 0< S <1, the value range of the candidate expert set size m meets m > K and m does not exceed the total number of experts, the judgment standard of high-confidence-degree word elements and low-confidence-degree word elements is regulated by regulating the size of the high quality threshold S, and the expert selection range of the low-confidence-degree word elements is regulated by regulating the value of the candidate expert set size m.
8. The MoE reasoning-oriented adaptive computation fusion Top-k routing scheduling method of claim 1, wherein after the tokens are distributed to the corresponding computing nodes, the number of the distributed tokens of the computing nodes is synchronously updated.
9. The electronic device is characterized by comprising a processor, a memory and a bus system, wherein the processor and the memory are connected through the bus system, the memory is used for storing instructions, and the processor is used for executing the instructions stored in the memory so as to realize the MoE reasoning-oriented adaptive general calculation fusion Top-k routing scheduling method according to any one of claims 1 to 8.
10. A computer storage medium storing a computer software product comprising instructions for causing a computer device to perform the MoE-inference oriented adaptive computation fusion Top-k route scheduling method of any one of claims 1 to 8.

Description

MoE reasoning-oriented self-adaptive calculation fusion Top-k route scheduling method Technical Field The invention relates to the technical field of artificial intelligence large model distributed reasoning, in particular to a self-adaptive general calculation fusion Top-k route scheduling method for MoE reasoning. Background With the rapid development of large-scale pre-training models and generative artificial intelligence, the parametric scale of neural network models has grown to the billions and even trillions. To continue to expand model capacity and increase model expressive power with limited computational power resources, a hybrid-of-Experts (MoE) architecture has evolved. The MoE architecture dynamically selects a small number of experts for each input word element to participate in calculation through a gating network, so that conditional calculation and sparse activation are realized, and the calculation resources required by single reasoning are effectively controlled on the premise of keeping huge capacity of the whole model. At present, a sparse MoE model becomes one of core architectures for large-scale landing of large models, and representative works comprise Sparsely-Gated MoE, GShard, switch Transformer and the like, which systematically propose sparse structures combining gating and multiple experts, and introduce mechanisms such as automatic slicing, expert parallelism and the like, so that the MoE model can complete training and reasoning on a distributed cluster. The DEEPSPEED-MoE engineering frameworks further realize the efficient deployment of a large number of experts on multiple GPU nodes of the intelligent computation center through means of expert parallelism, data parallelism and the like, wherein the GPU nodes are computation servers provided with graphic processors and serve as basic computation units of the distributed clusters. In the engineering landing process, the reasoning performance of the MoE model is highly dependent on the design of the expert routing algorithm. In the distributed cluster, all expert modules are deployed on different GPU nodes according to expert parallel strategies, and expert routing decisions of each word element directly determine communication load among the nodes, balance of GPU computing power distribution and final reasoning response speed. The existing routing algorithm generally adopts a Top-K based gating strategy, for example, a classical gating Top-K fixed routing algorithm outputs the weight of each expert through a gating network, and K experts with the largest weight are directly selected to execute calculation. Another improvement introduces a capacity factor, sets a static threshold for the maximum number of receivable tokens for each expert, and discards or reassigns redundant tokens when the candidate token exceeds the threshold. However, these algorithms focus on model-level weight optimization alone, and fail to combine with the underlying hardware features of the intelligent computation center and real-time resource status, resulting in a series of technical problems. Specifically, the existing algorithm cannot sense the information such as the real-time computing load of the GPU node where the expert is located, the length of the allocated word line queue and the like when the routing decision is made, so that the GPU node corresponding to the high-weight expert is easy to have serious task queuing, word line discarding and even overload downtime, and the GPU node corresponding to the low-weight expert is in an idle state for a long time, so that serious waste of computing power resources of the GPU of the intelligent computation center is caused. Even if a capacity factor is introduced, the static threshold value of the capacity factor cannot adapt to the dynamic change of the node load, and the problem of load unbalance is difficult to solve fundamentally. In addition, the communication delay and bandwidth influence caused by the physical network position difference of the expert are generally ignored in the conventional algorithm, so that the long-distance communication flow of the expert nodes with a large number of words routed to a long network distance through a low-bandwidth link is greatly increased, the average response time and long tail effect of MoE model reasoning are remarkably increased, the network congestion of an intelligent computation center is caused, and the normal operation of other services in the cluster is interfered. Disclosure of Invention Therefore, the technical problem to be solved by the invention is to aim at the MoE model reasoning scene in the distributed cluster environment of the intelligent computation center, and the existing expert routing algorithm can not cooperatively sense the power resource state of the bottom GPU and the network communication bandwidth resource characteristics, so that the problems of unbalanced power load, over-high reasoning time delay and network bandwidth