CN-122019149-A - Request batch scheduling method based on global optimization and two-dimensional load sensing

CN122019149ACN 122019149 ACN122019149 ACN 122019149ACN-122019149-A

Abstract

The invention discloses a request batch scheduling method based on global optimization and two-dimensional load perception, firstly, a two-dimensional request feature vector is constructed, then defining a two-dimensional joint delay model, then optimizing the global batch across a plurality of P instances, and finally formulating a scheduling-routing cooperative feedback mechanism. The method realizes the global batch optimization across P for the first time, breaks through the traditional P-by-P greedy strategy, and minimizes the average time delay from the perspective of a system level.

Inventors

LI JING
BAI YANDONG
LI YANAN

Assignees

西北工业大学

Dates

Publication Date: 20260512
Application Date: 20260112

Claims (7)

1. A request batch scheduling method based on global optimization and two-dimensional load sensing is characterized by comprising the following steps: Step 1, constructing a two-dimensional request feature vector; For each Prefill requests to be processed Extracting: attention dimension feature, sequence length ; MoE dimension features semantic complexity And pre-estimated expert activation distribution ; Representing the total number of routing experts; step 2, defining a two-dimensional joint time delay model; for any one batch of requests The estimated execution delay on the P instance is: where e i,e is a request For the activation intensity of expert e, p is more than or equal to 2, and lambda E [0,1] is an adjustable weight; Step3, global batch optimization across multiple P instances; All to-be-processed Prefill request sets R are distributed to K P instances to form K batches B 1 ,...,B K , and the following optimization problem is solved: solving the optimization problem through a heuristic clustering algorithm to realize the minimization of global average time delay; Step4, a scheduling-routing cooperative feedback mechanism; after the actual execution, the actual load of each expert is collected, and the next round of dynamic adjustment is carried out And the prediction model and the lambda weight form closed-loop optimization.
2. The request batch scheduling method based on global optimization and two-dimensional load sensing according to claim 1, wherein the pre-estimated expert activation distribution predicts Top-K expert IDs and weights thereof through a lightweight routing simulator.
3. The method for batch scheduling of requests based on global optimization and bi-dimensional load awareness according to claim 1, wherein the heuristic clustering algorithm approximates a modified K-means++ or integer programming.
4. An electronic device comprising a processor and a memory, the memory for storing a computer program, the processor for executing the computer program stored by the memory to cause the electronic device to perform the method of any one of claims 1 to 3.
5. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1 to 3.
6. A chip comprising a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method of any one of claims 1 to 3.
7. A computer program product comprising a computer storage medium storing a computer program comprising instructions executable by at least one processor, the instructions when executed by the at least one processor implementing the method of any one of claims 1 to 3.

Description

Request batch scheduling method based on global optimization and two-dimensional load sensing Technical Field The invention belongs to the technical field of large models, and particularly relates to a request batch scheduling method based on global optimization and two-dimensional load sensing. Background In the reasoning deployment of a large-scale hybrid Expert (MoE) large model, an EP (Expert-Parallel) architecture is commonly adopted in the industry, namely, an Expert sub-network is distributed across devices, and Prefill/Decode (P/D) separation scheduling is matched to improve throughput. Wherein, the operator-level performance bottleneck is mainly concentrated on two modules: the Attention module is used for computing (O (L2)) and occupied KV cache in Prefill stages; MoE expert module, dynamic route activating part expert, wherein the calculation cost is strongly related to expert load distribution. The current mainstream inference engines (such as vLLM, SGLang) employ the following scheduling policies in a multi-P multi-D scenario: 1. P greedy batching, namely, a scheduler constructs a batch of requests for each Prefill instance (P) independently in turn, and the requests are usually ordered according to sequence lengths or arrival times, and global request allocation optimization is not carried out across a plurality of P instances; 2. Single-dimension load balancing, namely only minimizing the variance of the Attention time delay in the batch (such as balancing the total number of token) when the batch is assembled, and completely ignoring expert load inclination possibly caused by the batch request after MoE routing; 3. Static routing is decoupled from scheduling, the MoE routing decisions are completed inside the model, and the scheduler cannot predict which experts will be activated by a certain batch of requests, so that the high-load experts are repeatedly and intensively invoked. The above strategy causes the following problems: 1. The overall average time delay is not optimal, namely the local optimal batch cannot ensure that the total execution time of the global P instance is minimum; 2. Expert hot spots frequently occur that certain batches of requests have similar semantics (such as code generation), so that Top-K experts are highly overlapped, and GPU video memory bandwidth bottleneck or computing unit dispute is caused; 3. and the resource utilization rate fluctuates greatly, wherein part of P instances are blocked due to overload of experts, and the rest P instances wait idle, so that the system throughput is reduced. Therefore, there is a need for a multi-P multi-D architecture oriented, cross-P globally optimized, two-dimensional (attention+moe) joint batch approach. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a request batch scheduling method based on global optimization and two-dimensional load sensing, which comprises the steps of firstly constructing a two-dimensional request feature vector, then defining a two-dimensional joint delay model, then performing global batch optimization on a plurality of P instances, and finally formulating a scheduling-routing cooperative feedback mechanism. The method realizes the global batch optimization across P for the first time, breaks through the traditional P-by-P greedy strategy, and minimizes the average time delay from the perspective of a system level. The technical scheme adopted for solving the technical problems is as follows: Step 1, constructing a two-dimensional request feature vector; For each Prefill requests to be processed Extracting: attention dimension feature, sequence length ; MoE dimension features semantic complexityAnd pre-estimated expert activation distribution;Representing the total number of routing experts; step 2, defining a two-dimensional joint time delay model; for any one batch of requests The estimated execution delay on the P instance is: where e i,e is a request For the activation intensity of expert e, p is more than or equal to 2, and lambda E [0,1] is an adjustable weight. Step3, global batch optimization across multiple P instances; All to-be-processed Prefill request sets R are distributed to K P instances to form K batches B 1,...,BK, and the following optimization problem is solved: solving the optimization problem through a heuristic clustering algorithm to realize the minimization of global average time delay; Step4, a scheduling-routing cooperative feedback mechanism; after the actual execution, the actual load of each expert is collected, and the next round of dynamic adjustment is carried out And the prediction model and the lambda weight form closed-loop optimization. Preferably, the pre-estimated expert activation profile predicts the Top-K expert ID and its weight by a lightweight routing simulator. Preferably, the heuristic clustering algorithm approximates a modified K-means++ or integer programming. An electronic device comprises a pr