CN-122021869-A - Large model reasoning method and system based on active load sensing

CN122021869ACN 122021869 ACN122021869 ACN 122021869ACN-122021869-A

Abstract

The invention discloses a large model reasoning method and system based on active load sensing, and belongs to the technical field of large model reasoning services. Aiming at the problem that the reasoning service is difficult to consider the resource efficiency and the service quality due to the burstiness and the isomerism of the mixed load in the prior art, the invention provides a closed loop method of cycle planning, online scheduling and self-adaptive tuning. The method comprises a period planning stage, an online scheduling stage, a self-adaptive scheduling stage and a fine-grained correction stage, wherein the period planning stage utilizes long-term load prediction to perform prospective instance pre-configuration, the online scheduling stage uses response length sensing and heterogeneous scheduling strategies to route and schedule requests, and the self-adaptive scheduling stage performs fine-grained correction on instance scale based on peak working set pre-judgment. The invention cooperates with resource supply and request scheduling, effectively suppresses delay spike and improves GPU utilization rate while guaranteeing SLO, and is suitable for enterprise-level LMaaS platforms.

Inventors

WANG XIAOFEI
LI YUTING
QIU CHAO
HUANG SHAOYUAN
ZHANG TENGWEN
ZHAO YUNFENG

Assignees

天津大学

Dates

Publication Date: 20260512
Application Date: 20251205

Claims (10)

1. The large model reasoning method based on active load sensing is characterized by comprising the following steps of: Based on historical load data, predicting the workload of the next period by using a long-term load prediction model, determining the basic instance scale of the next period according to the predicted workload and preset average throughput, and executing instance pre-configuration; When an inference request is received, estimating the response length of the inference request, selecting a target instance for routing the inference request based on the response length estimation, the current load state of the instance and the prompt length heterogeneity, and carrying out in-instance batch scheduling based on the urgency of the inference request in the target instance; And calculating the resource utilization rate of the instance according to the peak resource occupation, and dynamically adjusting the instance scale based on the comparison of the utilization rate and a preset threshold value.
2. The method of claim 1, wherein the long-term load prediction model is trained using an asymmetric loss function that penalizes underfitting more than overfitting.
3. The method of claim 1, wherein estimating the response length of the inference request comprises maintaining a conditional probability distribution based on the hint length and the response length of the historical request, and updating the conditional probability distribution using exponential decay and Dirichlet smoothing to estimate the response length of the current inference request.
4. The method of claim 1, wherein selecting a target instance for routing an inference request comprises computing an estimated total delay for the inference request on each candidate instance by taking into account a combination of a pre-fill phase delay, a decode phase delay, an instance congestion factor, and a mismatch of a request hint length and an instance queue average hint length, and selecting an instance with a minimum of the estimated total delay as the target instance.
5. The method of claim 1, wherein the intra-instance scheduling based on urgency of inferred requests includes calculating a pressure score for each inferred request that quantifies how urgency of the request is from its predicted completion time or service level target deadline, and in batch scheduling, prioritizing requests with higher pressure scores or preempting requests with lower pressure scores if necessary.
6. The method of claim 1, wherein predicting peak resource occupancy in a batch process comprises determining an earliest completed request in a current batch, and estimating peak resource occupancy in the batch process based on currently occupied resources for all requests in the batch and additional resource occupancy required before completion of the earliest completed request.
7. A large model reasoning system based on active load sensing, comprising: A long-term load prediction module (100) for predicting a workload of a next cycle based on the historical load data; A cycle planning and provisioning module (200) for determining and pre-configuring a base instance size for a next cycle based on the predicted workload; The system comprises an online scheduling module (300) for estimating the response length of an inference request when the inference request is received, routing the request to a target LLM instance (500) based on the response length, instance state and heterogeneity and scheduling the request in the target instance based on urgency, and an adaptive tuning module (400) for predicting peak resource occupancy in a batch process during instance operation and dynamically adjusting instance scale according to the predicted resource utilization.
8. The system of claim 7, wherein the long-term load prediction module (100) is configured to train using an asymmetric loss function to impose a higher penalty on the under-fit prediction.
9. The system of claim 7, wherein the online scheduling module (300) further comprises: A response length sensor (310) for estimating a response length based on historical statistics and Dirichlet smoothing; An instance router (320) for selecting a target instance based on the heterogeneously perceived latency model; And an intra-instance scheduler (330) for performing preemptive scheduling based on the pressure score.
10. The system of claim 7, wherein the adaptive tuning module (400) further comprises: a peak working set predictor (410) for estimating a lot peak video memory by identifying an earliest completion request; and an instance reconfigurator (430) for triggering a capacity increasing or reclaiming action when the peak video memory calculated utilization is continuously higher than a high threshold value or lower than a low threshold value.

Description

Large model reasoning method and system based on active load sensing Technical Field The invention relates to the technical field of large model reasoning service, in particular to a large model reasoning method and system based on active load sensing. Background In recent years, the form of inference services (LMaaS) in cloud computing environments has become a mainstream for Large Language Models (LLMs). Such services typically require two major core tasks of persistent processing resource provisioning (i.e., dynamic scaling of instances) and request scheduling (i.e., cross-instance routing and intra-instance batch processing) in a multi-tenant environment. However, the load of large model reasoning services is highly complex, i.e. traffic exhibits a high burstiness on a short time scale, and periodic regularity on a long time scale. At the same time, the user request has significant heterogeneity in hint length, expected response length, and Service Level Objective (SLO). When the existing reasoning system is used for coping with such complex loads, it is difficult to consider resource efficiency (cost and utilization) and service quality (time delay and SLO). On the one hand, the passive capacity expansion and contraction strategy which simply depends on the index during operation cannot respond to sudden load peaks in time due to overlong cold start time of large model examples, and serious time delay peaks and SLO violations are easy to cause. On the other hand, the static excessive pre-allocation strategy adopted for coping with the peak value can lead to a large amount of idle GPU resources in the load stationary period (trough), and has high running cost and low utilization rate. In addition, existing request schedulers tend to ignore the complexity of the heterogeneous nature between request length and SLO priority. This tends to cause a "head-end blocking" phenomenon, i.e., long running low priority requests block short, high priority requests, thereby affecting overall quality of service. If the scheduling policy is too aggressive, trying to spell at high density, it is easy to trigger a memory over-commitment, resulting in the request being evicted or aborted. Therefore, a new technical solution is urgently needed in the art to solve the technical problem of how to cooperate with the resource supply and request scheduling under the highly bursty and heterogeneous mixed load, so as to realize high resource utilization while guaranteeing the service quality (low latency and SLO). Disclosure of Invention The invention mainly aims to provide a large model reasoning method and system based on active load sensing so as to solve the problems in the related art. In order to achieve the above object, according to one aspect of the present invention, there is provided a large model reasoning method based on active load awareness, including predicting workload of a next cycle by using a long-term load prediction model based on historical load data, determining a basic instance scale of the next cycle according to the predicted workload and a preset average throughput, and performing instance pre-configuration, estimating a response length of a reasoning request when the reasoning request is received, selecting a target instance for routing the reasoning request based on the response length estimation, a current load state of the instance and a prompt length isomerism, and performing intra-instance batch scheduling based on urgency of the reasoning request within the target instance, predicting peak resource occupation during an instance operation period, calculating a resource utilization of the instance according to the peak resource occupation, and dynamically adjusting the instance scale based on a comparison of the utilization and a preset threshold. As a preferred solution of the present invention, the long-term load prediction model is trained using an asymmetric loss function, where the penalty for under-fitting (predicted load lower than actual load) is higher than the penalty for over-fitting (predicted load higher than actual load). As a preferred technical scheme of the invention, the estimating the response length of the reasoning request comprises maintaining a conditional probability distribution based on the prompt length and the response length of the history request, and updating the conditional probability distribution by using exponential decay and Dirichlet smoothing to estimate the response length of the current reasoning request. The method comprises the steps of comprehensively considering pre-filling phase time delay, decoding phase time delay, instance congestion factors and mismatching degree of request prompt length and instance queue average prompt length, calculating predicted total time delay of the reasoning request on each candidate instance, and selecting the instance with the smallest predicted total time delay as the target instance. As a preferred solution of the presen