CN-121979636-A - Mixed request-oriented large language model self-adaptive scheduling algorithm

CN121979636ACN 121979636 ACN121979636 ACN 121979636ACN-121979636-A

Abstract

The invention discloses a large language model self-adaptive scheduling algorithm facing mixed requests, which comprises S1, initializing queues Q1, Q2 and Q3 and emptying a set CB, S2, calculating the priority of the new request when the new request arrives, adding Q1 or Q2 according to the type of the new request, then entering S3, entering S3 when no new request arrives, S3, judging If yes, entering S4, otherwise stopping the algorithm and starting when a new request exists, S4 updating priorities of requests in queues Q1, Q2 and Q3, S5 traversing updated Q1, Q2 and Q3 in sequence, selecting the requests to be added into the CB, S6 inputting the CB into a large language model for reasoning 1 time, removing the completed requests from the CB and recording the completion time of the completed requests, S7 entering the step S2 when the CB is empty, enabling k=k+1 when the CB is not empty, returning to S6 when k < Y-1, otherwise entering S8, S8 returning the requests from the Q1 to the incomplete requests in the CB to the Q1, and returning the requests from the Q2 and the Q3 to the Q3 after the completed requests are returned to the step S2.

Inventors

LIU LING
ZHOU SHASHA
ZHOU PAN
Seoul hit machine
XU XIAOQIONG
CHEN XI
WANG QIANG

Assignees

西南民族大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (9)

1. A hybrid request oriented large language model adaptive scheduling algorithm, comprising the steps of: S1, initializing deadline a request queue Q1, queues Q2 and Q3 in a best-effort request Prefill stage and a Decode stage, and enabling a current batch processing request set CB to be empty, wherein the requests in the queues Q1, Q2 and Q3 are automatically ordered in descending order of priority; S2, when a new request arrives, calculating the priority of the new request, adding the new request into the queue Q1 or the queue Q2 according to the type of the new request, and then entering the step S3; S3, judging If yes, entering a step S4, otherwise, terminating the algorithm and starting when a new request exists; s4, updating the priority of the request in the queue Q1 according to the deadline, and updating the priority of the request in the queues Q2 and Q3 according to the pre-filling time and the number of token generated by the request to obtain updated queues Q1, Q2 and Q3; s5, traversing the updated queues Q1, Q2 and Q3 in turn according to the residual GPU cache and the request deadline, and selecting a request to be added into the set CB; S6, inputting the collection CB into the large language model to execute 1 time of reasoning, removing the completed request from the collection CB, and recording the completion time of the completed request; s7, when the set CB is empty, entering a step S2, when the set CB is not empty, enabling the reasoning times k=k+1, and returning to the step S6 when k < Y-1, otherwise, entering a step S8, wherein Y is the batch processing operation times; s8, for the outstanding requests in the set CB, the requests from the queue Q1 are put back to the queue Q1, the requests from the queues Q2 and Q3 are put back to the queue Q3, and then the step S2 is returned.
2. The hybrid request oriented large language model adaptive scheduling algorithm of claim 1, wherein step S4 further comprises: S41, traversing the request in the queue Q1, updating the priority by adopting a formula 1 when the request is in a Prefill stage, and updating the priority by adopting a formula 2 when the request is in a Decode stage, wherein the expressions of the formula 1 and the formula 2 are respectively as follows: Wherein, the Priority for request i in queue Q1; For calculating the moment when the priority of the request i is calculated; 、 And The arrival time, the deadline and the pre-filling time of the request i are respectively; S42, traversing the requests in the queue Q2 and updating the priority of each request when the priorities of all the requests in the queue Q1 are updated , For requests in queue Q2 Is set according to the priority of (1), To request Is used for the pre-filling time of (2); S43, traversing the requests in the queue Q3 and updating the priority of each request when the priorities of all the requests in the queue Q2 are updated , For requests in queue Q2 Is a priority of (3); To request The number of tokens generated; S44, when the priorities of all requests of the queue Q3 are updated, updated queues Q1, Q2 and Q3 are obtained.
3. The hybrid request oriented large language model adaptive scheduling algorithm of claim 1, wherein when a new request is received, the pre-fill time of the new request is calculated based on the length of its complete sequence: Wherein, the Pre-fill time for the new request; 、 And Are fitting coefficients; The maximum number of requests that can be accommodated in the batch; The length of the complete sequence is requested for the new request.
4. The large language model adaptive scheduling algorithm for hybrid requests as claimed in claim 2, wherein the method for calculating the priority of the new request in step S2 is to calculate the priority of the new request by using equation 1 when the new request is deadline, and to calculate the priority of the new request by using best-effort when the new request is deadline The priority of the new request is calculated.
5. The hybrid request oriented large language model adaptive scheduling algorithm of claim 1, wherein the step S5 further comprises: S51, traversing a queue Q1 and adding a request adjacent to the deadline to a set CB when residual GPU caches exist; And S52, after the traversing queue Q1 is completed, when the residual GPU cache still exists, sequentially traversing the requests in the queues Q2 and Q3 to be added into the set CB.
6. The hybrid request oriented large language model adaptive scheduling algorithm of claim 5, wherein step S51 further comprises: S511, initializing i=1, and initializing KV caches occupied by all requests in the set CB ; S512, judging If so, the process proceeds to step S513, otherwise to step S52, The total number of elements in the queue; S513, calculating the remaining completion time of the request i in the queue Q1 And remaining available time : , Wherein, the Numbering for request i; To request an average token length; the number of token generated for request i; 、 And The pre-filling time, the expiration time and the arrival time of the request i are respectively; one iteration time for a batch when the batch size is B and the number of all requested tokens is the average length; S514, judging conditions If so, let i=i+1 and return to step S512, otherwise proceed to step S515, To request i to reach The shortest time is also required; Y is batch processing operation times; S515, when the set CB is empty, When the set CB is not empty, all KV caches occupied by the requests in the set CB are calculated And remaining GPU caches : Wherein, the M is the number of the request in the set CB; And The number of input tokens of the kth 1 request and the number of generated tokens, respectively; KV cache occupied by 1 token; KV cache size available for GPU; s516, judging condition If so, the process proceeds to step S517, otherwise step S52, KV cache occupied for request i; S517, adding the request i into the set CB, and letting ,i After that, the process returns to step S512.
7. The hybrid request oriented large language model adaptive scheduling algorithm of claim 5, wherein the method of sequentially traversing requests in queues Q2 and Q3 to join in set CB comprises: S521, judging If so, proceeding to step S522, otherwise proceeding to step S525, For the request sequence number in queue Q2, The total number of elements in the queue; S522, calculating KV cache occupied by all requests in collection CB And remaining GPU caches ; S523, judging If yes, go to step S524, otherwise let And returns to step S521, To request Is the number of (2); To request The occupied KV cache is used up to store, KV cache occupied by 1 token; s524, will request Adding the request quantity to the set CB to enable the request quantity in the set CB to be , After that, the process returns to step S521; S525, judging If so, go to step S526, otherwise output set CB, A request sequence number in queue Q3; S526, calculating KV cache occupied by all requests in collection CB And remaining GPU caches ; S527, judge If yes, go to step S524, otherwise let After that, returning to step S525, For requests in queue Q3 Is provided with the number of (a), KV cache occupied for request j; S528, will request Adding the mixture into a set CB to enable , After that, the process returns to step S525.
8. The hybrid request oriented large language model adaptive scheduling algorithm of claim 6, wherein an iteration time is calculated The expression of (2) is: Wherein, the 、 And Is the optimal linear regression coefficient; Is a batch size; Number of token generated for all requests.
9. The hybrid request oriented large language model adaptive scheduling algorithm of claim 6 or 7, wherein 1 token-occupied KV cache is calculated The expression of (2) is: Wherein, the The number of layers is the large language model; The number of attention heads per layer for a large language model; For each attention head dimension; The memory size corresponding to the data type.

Description

Mixed request-oriented large language model self-adaptive scheduling algorithm Technical Field The invention relates to the technical field of large model scheduling, in particular to a large language model self-adaptive scheduling algorithm for a hybrid request. Background The rapid development of large language models (Large Language Model, LLM) represented by GPT, LLaMA, deepSeek and the like has prompted their widespread adoption in various fields. Some LLM driven online applications, such as code generation and robotic requests, have stringent requirements on reasoning completion times (otherwise known as task completion times, job Completion Time, JCT). Typically, these LLM reasoning requests impose a cut-off (deadline) constraint on JCT. Thus, such a request is referred to as a "deadline request". In contrast, other offline applications, such as information extraction, data manipulation, and document summarization, are not sensitive to inference completion time, but instead focus on throughput or average JCT. We classify these requests as "best-effort" requests. While traditional machine learning reasoning requires only one forward propagation to obtain results, LLM reasoning has autoregressive properties. In other words, the overall LLM reasoning process consists of multiple iterations, where each iteration generates one token (token), with the exception of the first token, each subsequent token being inferred and generated from all previously generated tokens. Specifically, the LLM reasoning request includes two distinct phases. One is a pre-fill (prefill) phase, which calculates a first output token (token) from the user's input prompt. Stage prefill is computationally intensive and requires high computational power. The other is the decoding stage, which autoregressively generates the subsequent output tokens. During LLM reasoning, each subsequent token needs to be computed with all generated tokens, which can result in a large number of redundant computations. Therefore, in order to accelerate the reasoning speed and reduce the calculated amount, a KV cache mechanism is introduced in the LLM reasoning process, and the result of previous K and V calculation is cached, so that the KV cache is directly read when a new token is generated subsequently, and the recalculation is not needed. Therefore, the decode stage is memory intensive and requires high memory. The autoregressive nature of LLM reasoning makes the output token length of a request unknown in advance. These features present significant challenges to LLM inference scheduling. For example, the typical scheduling algorithm for optimizing JCT, shortest Job First (SJF), is not applicable to LLM inference scheduling because it requires a priori output length knowledge. Therefore, most LLM inference systems currently use First-Come-First-served scheduling (First name FIRST SERVER, FCFS), which processes requests sequentially in order of arrival. However, FCFS suffers from queue head blocking, where long requests queued at the head of the queue can stall subsequent requests, thereby adversely affecting overall inferential completion time delay. In general, LLM service systems typically provide services to multiple users simultaneously. Thus, the scheduling order of the requests directly affects the completion time of the requests, and thus the user's service level objective (SERVICE LEVEL Objectives, SLO). In addition, since LLM reasoning is typically performed by packing multiple requests in parallel (i.e., batch), the batch size (batch size) can also have a significant impact on SLO. However, there is currently no scheduling optimization algorithm that considers JCT deadline and best-effort hybrid requests at the same time. Existing large model scheduling algorithms mostly focus on TTFT (Time to First Token, first Token delay) and TPOT (Time Per Output Token, single Token generation time) deadline, without considering end-to-end JCT deadline. Disclosure of Invention Aiming at the defects in the prior art, the large language model self-adaptive scheduling algorithm facing the mixed request solves the problems that the long-time request arranged at the head of the queue can delay the subsequent request, so that the whole inference completion time of the request is delayed, and the service level of a user is influenced in the prior art. In order to achieve the aim of the invention, the invention adopts the following technical scheme: A big language model self-adaptive scheduling algorithm facing to a mixed request is provided, which comprises the following steps: S1, initializing deadline a request queue Q1, queues Q2 and Q3 in a best-effort request Prefill stage and a Decode stage, and enabling a current batch processing request set CB to be empty, wherein the requests in the queues Q1, Q2 and Q3 are automatically ordered in descending order of priority; S2, when a new request arrives, calculating the priority of the new request, adding the new