CN-121996437-A - Large language model reasoning effective throughput optimization method, system, equipment and medium

CN121996437ACN 121996437 ACN121996437 ACN 121996437ACN-121996437-A

Abstract

The invention discloses a large language model reasoning effective throughput optimization method, a system, equipment and a medium, which are corresponding schemes, in the scheme, word level segmentation is carried out on a reasoning request, and an optimal segmentation point is dynamically determined, in the process, simulation queuing and scheduling can be carried out on the segmented micro request under the condition that reasoning calculation is not actually carried out, so that delay constraint and load balancing constraint are verified, and the segmentation point positions are gradually adjusted and searched.

Inventors

LI CHENG
TIAN DONGQI
CHEN YINHE
SHI YANDONG
Bai Youhui

Assignees

中国科学技术大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. A large language model reasoning goodput optimization method, comprising: The method comprises the steps of receiving an inference request, carrying out decoding length estimation, determining the number of predicted generated tokens, and constructing a logic request, carrying out segmentation on the logic request to obtain an initial segmentation scheme, judging whether delay constraint and load balancing constraint are met through simulated scheduling, if not, adjusting the position of a segmentation point by taking the initial segmentation scheme as a starting point, continuously judging whether the delay constraint and the load balancing constraint are met through the simulated scheduling, and continuously iterating until an optimal segmentation scheme is obtained, wherein the optimal segmentation scheme comprises two micro-requests after the logic request is segmented; sending each micro request to an execution module one by one according to the sequence; the execution module infers the received micro-requests by calling the corresponding execution resources.
2. The large language model inference goodput optimization method of claim 1, wherein receiving the inference requests comprises processing the received inference requests in text form as an initial sequence of tokens consisting of a plurality of tokens and joining a queue to be scheduled.
3. The large language model reasoning goodput optimization method of claim 2, wherein the performing decoding length estimation to determine the number of predictively generated tokens, thereby constructing the logical request comprises: Estimating decoding length according to one or a combination of any multiple of the following information to determine the number of predicted generated words, wherein the information comprises the service type of an inference request, prompt word content characteristics, historical statistical information and preset generation parameters, and the historical statistical information comprises decoding length distribution corresponding to the similar prompt word length, statistics of the number of the predicted generated words under different service types and mapping relations between the content characteristics of different prompt words and the number of the predicted generated words; Combining the prompting lemmas and the number thereof with the predicted generated lemmas number to construct a logic request, wherein the prompting lemmas are all lemmas in the initial lemma sequence.
4. The large language model reasoning effective throughput optimization method of claim 1, wherein the determining whether the delay constraint and the load balancing constraint are satisfied by the simulated scheduling comprises: The method comprises the steps of maintaining a historical dispatching result and a performance portrait, wherein the historical dispatching result is used for representing the queue state of each execution resource, the scheduled batch and the execution sequence thereof, the performance portrait is used for representing a data structure of the mapping relation between different batch characteristics and execution time, the batch characteristics at least comprise batch size, number of prompt tokens in the batch and number of decoding tokens in the batch, and the decoding tokens are generated after corresponding batches are executed; The initial segmentation scheme comprises a set segmentation point position, two micro-requests after the logic request is segmented based on the set segmentation point position, the two micro-requests are subjected to simulated scheduling based on the historical scheduling result, the two micro-requests are executed through corresponding execution resources, each execution resource executes batch by batch, and each micro-request needs to execute a plurality of batches; based on the execution time of the performance portrait estimated batch, obtaining an end-to-end time delay estimation result and the estimated calculation time of two micro-requests, wherein the estimated calculation time of the micro-requests is the sum of the execution time of the corresponding batch, and the end-to-end time delay estimation result is the sum of the waiting batch and the execution time of the execution batch of the two micro-requests on the corresponding execution resource; and judging whether delay constraint and load balancing constraint are met or not based on the end-to-end delay estimation result and the estimated calculation time of the two micro-requests.
5. The method for optimizing the throughput of a large language model reasoning system according to claim 4, wherein the determining whether the delay constraint and the load balancing constraint are satisfied based on the end-to-end delay estimation result and the estimated computation time of the two micro-requests comprises: judging whether delay constraint is met, wherein the delay constraint is a service level target, judging whether an end-to-end delay estimation result does not exceed the service level target, and if so, meeting the delay constraint; When the delay constraint is met, whether the load balancing constraint is met or not is continuously judged, namely whether the difference value of the estimated calculation time of the two micro-requests does not exceed a preset threshold value or not, and if yes, the load balancing constraint is met.
6. The method for optimizing the throughput by inference of a large language model according to claim 4, wherein the steps of adjusting the position of the segmentation point with the initial segmentation scheme as a starting point, and continuously judging whether the delay constraint and the load balancing constraint are satisfied through the simulated scheduling, and continuously iterating until the optimal segmentation scheme is obtained comprise the steps of: The method comprises the steps of dividing a logic request into two micro-requests, namely a front micro-request and a rear micro-request, based on the dividing point position, adjusting dividing point position according to judging results of whether delay constraint and load balancing constraint are met, wherein the dividing point position is used for dividing the logic request into the two micro-requests, namely the front micro-request and the rear micro-request, if the expected calculation time of the front micro-request is longer than that of the rear micro-request, moving the dividing point position k forwards, otherwise, moving the dividing point position k backwards, and moving the dividing point position k by a preset moving step length Updating the position of the segmentation point, wherein the position of the segmentation point is updated as follows , wherein, Assigning a symbol to the user; judging whether the delay constraint and the load balancing constraint are met through simulated scheduling, if the maximum iteration number is reached, the optimal segmentation scheme meeting the delay constraint and the load balancing constraint is not obtained, setting delay time based on historical scheduling results, continuing iteration after the delay time is reached, and if the segmentation scheme meeting the delay constraint and the load balancing constraint is not obtained after the specified delay number, taking the initial segmentation scheme as the optimal segmentation scheme.
7. The large language model reasoning goodput optimization method of claim 1 or 6, wherein sequentially sending each micro-request to the execution module one by one comprises: The optimal segmentation scheme comprises two micro-requests after the logic request is segmented, namely a front micro-request and a rear micro-request which are obtained by segmenting the logic request through segmentation points corresponding to the optimal segmentation scheme; and sending the precursor micro-request to the execution module, and sending the subsequent micro-request to the execution module after receiving the execution result of the precursor micro-request returned by the execution module.
8. A large language model reasoning goodput optimization system for implementing the method of any of claims 1-7, comprising: The scheduling module is used for receiving the reasoning request, estimating the decoding length, determining the number of predicted generated words and constructing a logic request, segmenting the logic request to obtain an initial segmentation scheme, judging whether delay constraint and load balancing constraint are met through simulated scheduling, if not, adjusting the segmentation point position by taking the initial segmentation scheme as a starting point, continuously judging whether the delay constraint and the load balancing constraint are met through simulated scheduling, and continuously iterating until an optimal segmentation scheme is obtained, wherein the optimal segmentation scheme comprises two micro-requests obtained by segmenting the logic request; the distribution module is used for sequentially sending the micro requests to the execution module one by one; And the execution module is used for reasoning the received micro-requests by calling the corresponding execution resources.
9. A processing device comprises one or more processors, a memory for storing one or more programs, a first memory for storing one or more programs, a second memory for storing one or more programs, and a third memory for storing one or more programs; Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.

Description

Large language model reasoning effective throughput optimization method, system, equipment and medium Technical Field The invention relates to the technical field of large language model reasoning optimization, in particular to a large language model reasoning effective throughput optimization method, a system, equipment and a medium. Background With the wide application of large language models in scenes such as dialogue, search question and answer, code generation, content creation and the like, online reasoning of the large language models has become an important basic capability. In practical deployment, a large language model is computationally intensive and has strong dependence on video memory and computing power, and is usually required to be deployed on high-performance hardware such as a GPU (graphic processor), so that the hardware purchasing and operation and maintenance costs are high, the reasoning service needs to improve throughput as much as possible to improve the resource utilization rate and reduce unit request cost, and meanwhile, the Service Level Objective (SLO) needs to be met, so that the concept of effective throughput can be introduced, namely the request quantity or token quantity generated by the system in unit time according to the SLO requirement. How to optimize the effective throughput of the reasoning services has become critical. Large language model reasoning may include two phases of pre-population (Prefill) and Decoding (Decoding). The pre-filling stage mainly adopts large-scale matrix operation, has high calculation concentration and consumes a large amount of calculation force, and the decoding stage gradually generates subsequent word elements in an autoregressive mode, and frequent access and cache maintenance are needed in the process, so that requirements on video memory access and bandwidth are provided. Because the two phases have different demands on resources, the two phases easily generate resource contention and mutually interfere when sharing hardware, and further cause throughput fluctuation and time delay rise. Thus, a pre-fill phase and decode phase separation (PD separation) technique may be employed to schedule and execute the two phases separately on different computing resources or different execution instances to reduce resource contention and mutual interference. In addition, chunk Prefill (block pre-filling) technology can be adopted, namely, long input in the pre-filling stage is divided into a plurality of blocks according to a token sequence and is executed in batches, so that the interference of single long sequence pre-filling on the decoding stage is relieved, and the overall throughput and time delay stability of the system are improved. However, chunk Prefill, although alleviating the interference problem to some extent by dividing the long hint word into a plurality of pre-filled sub-blocks, still has a rough control over service delay, which is difficult to meet with strict SLO. On the other hand, when the conditions of request arrival burst, large request length distribution difference, dynamic concurrency change and the like are faced, the PD separation technology still can generate uneven resource matching, namely, one side is idle in computing power or bandwidth, and the other side becomes a system bottleneck due to load concentration, so that the whole throughput is limited. In view of this, the present invention has been made. Disclosure of Invention The invention aims to provide a large language model reasoning effective throughput optimization method, a system, equipment and a medium, which can improve the utilization rate of a computing resource (GPU) and the effective throughput of the large language model reasoning. The invention aims at realizing the following technical scheme: A large language model reasoning effective throughput optimization method comprises the following steps: The method comprises the steps of receiving an inference request, carrying out decoding length estimation, determining the number of predicted generated tokens, and constructing a logic request, carrying out segmentation on the logic request to obtain an initial segmentation scheme, judging whether delay constraint and load balancing constraint are met through simulated scheduling, if not, adjusting the position of a segmentation point by taking the initial segmentation scheme as a starting point, continuously judging whether the delay constraint and the load balancing constraint are met through the simulated scheduling, and continuously iterating until an optimal segmentation scheme is obtained, wherein the optimal segmentation scheme comprises two micro-requests after the logic request is segmented; sending each micro request to an execution module one by one according to the sequence; the execution module infers the received micro-requests by calling the corresponding execution resources. A large language model reasoning goodput optimization syst