CN-122019072-A - Reasoning task scheduling method of large language model

CN122019072ACN 122019072 ACN122019072 ACN 122019072ACN-122019072-A

Abstract

The invention discloses a large language model reasoning task scheduling method which comprises the steps of monitoring running states of N computing nodes in a network in real time, extracting multidimensional resource attributes of each computing node, constructing resource feature vectors, calculating resource difference degrees of the computing nodes according to the resource feature vectors of the computing nodes, outputting a reasoning task scheduling scheme based on a Markov decision process according to the resource difference degrees and real-time resource availability of the computing nodes and metadata of reasoning tasks to be processed, and scheduling the reasoning tasks to the corresponding computing nodes according to the scheduling scheme. The invention can realize fine-grained scheduling by comprehensively considering node load, network condition and LLM reasoning characteristic.

Inventors

WANG MU
WEI SHENGHUI
LIN YIYING
ZHANG ZIRUI
LIU YICHENG
JIA SIMING
GAO XIAOHAN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20251212

Claims (10)

1. An inference task scheduling method of a large language model, which is characterized by comprising the following steps: The method comprises the steps of monitoring the running states of N computing nodes in a network in real time, extracting multidimensional resource attributes of each computing node, and constructing resource feature vectors, wherein the computing nodes comprise cloud servers and edge devices; Calculating the resource difference degree of each computing node according to the resource feature vector of each computing node; outputting a scheduling scheme of the reasoning task based on a Markov decision process according to the resource difference degree and the real-time resource availability of each computing node and the metadata of the reasoning task to be processed; and dispatching the reasoning task to the corresponding computing node according to the dispatching scheme.
2. The method according to claim 1, wherein the calculating the resource diversity factor of each computing node according to the resource feature vector of each computing node is specifically: Calculating the resource diversity factor of the ith calculation node according to the following formula 1 : (Equation 1); Wherein, the For the resource feature vector of the ith compute node, For the resource mean vector in the i-th computation node neighborhood, Calculating a local covariance matrix of the resources in the node neighborhood for the ith calculation node; T represents a transpose operation, -1 represents matrix inversion; The resource attributes comprise computing unit computing power, memory residual capacity, network bandwidth fluctuation rate and configuration parameters of a currently loaded large language model LLM.
3. The method according to claim 1, wherein the outputting the scheduling scheme of the inference task based on the markov decision process according to the resource diversity degree and the real-time resource availability of each computing node and the metadata of the inference task to be processed specifically comprises: According to input Maximizing rewards based on Markov decision process, outputting Obtaining a scheduling scheme of an inference task, and correspondingly adjusting parameters of a Markov decision process according to a reward function; Wherein, the A state space for the markov decision process, comprising: 、、 ; Wherein, the A multidimensional resource knowledge base representing the time t is obtained by calculation according to the resource difference degree of each calculation node; representing a task queue status, which records metadata of the inference tasks to be processed, Representing the real-time resource availability of each computing node; Wherein, the The action space for the Markov decision process comprises decision vectors output for each reasoning task, wherein the decision vectors output for the reasoning task k comprise: E {0,1} sum If (1) =1 Means scheduling the reasoning task k to the computation node m for execution, otherwise =0; The specific calculation force distributed by the calculation node m for the reasoning task k is represented; the function of the reward is constructed according to penalty items including delay penalty items Resource efficiency penalty term The bonus items include inferential quality bonus items ; Wherein, the Taking a negative value for the sum of task transmission delay, calculation delay and propagation delay of each reasoning task so as to punish high-delay decisions; Taking a negative value for the sum of the resource difference degrees of the computing nodes scheduled by each reasoning task; ; Wherein, the Representing the predictive reasoning accuracy of scheduling the reasoning task k to the compute node m; Representing a preset forward rewarding weight coefficient, and giving rewards according to the accuracy when the accuracy constraint is met; representing a preset punishment weight coefficient, wherein the punishment weight coefficient is used for applying a strict punishment when the precision constraint is not satisfied; Representing the lowest reasoning accuracy requirement of the kth reasoning task; 、、 Are respectively 、、 And (5) setting a weight value.
4. The method of claim 3, wherein the step of, Specifically, the method is calculated according to the following formula 2: (equation 2); Wherein, the For the set attenuation coefficient(s), Is a multidimensional resource knowledge base at time t-1.
5. The method of claim 3, wherein the metadata of the inference task comprises an amount of input data of the inference task, a pre-estimated computational complexity of the inference task, and a minimum inference accuracy requirement of the inference task.
6. The method of claim 3, wherein the step of, Calculated according to the following equation 3: (equation 3); Wherein, the Representing a baseline accuracy of a base LLM deployed on a compute node m; for the marginal benefit coefficient to be set, The complexity is calculated for the prediction of the reasoning task k.
7. A method according to claim 3, wherein the scheduling scheme of the output inference task based on the markov decision process is specifically: on the premise of guaranteeing the minimum reasoning precision requirement of the task, a scheduling scheme of the reasoning task is output based on a Markov decision process.
8. The method of claim 7, wherein the outputting the scheduling scheme of the inference task based on the markov decision process on the premise of guaranteeing the minimum inference accuracy requirement of the task specifically comprises: calculated from an optimized objective function constructed according to equation 4 below As a final reward for a task offload model based on a markov decision process; (equation 4); Wherein, the Representing a period of time T Is used for the average value of (a), For the set adaptive penalty coefficients, Represents the lowest inference accuracy threshold value set, Auxiliary network representing task offloading model based on Markov decision process based on current state The accuracy constraint of the output satisfies the desired predicted value of the degree.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor when executing the computer program is adapted to implement the steps of a large language model reasoning task scheduling method as claimed in any one of claims 1-8.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and the computer program is executable by at least one processor, so that the at least one processor performs the steps of the inference task scheduling method of a large language model according to any one of claims 1 to 8.

Description

Reasoning task scheduling method of large language model Technical Field The invention relates to the technical field of computers, in particular to a method for dispatching inference tasks of a large language model. Background In recent years, with the widespread popularity of internet of things (IoT) devices and the rapid development of edge computing technology, the application of Large Language Models (LLMs) to the internet of things environment has become a trend. LLM has the advantages of high-efficiency data analysis, pattern recognition and generation capacity, and can provide unprecedented intelligent service for the ecological system of the Internet of things. However, deploying and running large language models on internet of things devices presents significant challenges, first, terminal computing power and storage limitations. The internet of things equipment is generally limited in computing capacity and storage capacity, and a complex deep learning model is difficult to directly run. While tasks can be offloaded entirely to the cloud, this can cause network delays to be a bottleneck, and frequent data transmissions can consume a large amount of bandwidth and increase operating costs. Second, edge resources are heterogeneous and model-intensive. Although the hardware performance of the edge devices is better than that of the internet of things terminals, large models like DEEPSEEK R1-14B require at least 24GB of memory, which still exceeds the load-bearing capacity of most edge devices. This requires a trade-off between model inference accuracy and propagation delay, i.e., choosing whether to offload tasks to the cloud or to the edge. Third, the iterative nature of the task is inferred. Unlike batch modes of traditional artificial intelligence tasks, the reasoning process of large language models has significant iterative features. The processing time of a single request is not only dependent on its own task complexity, but is also affected by other requests in the same batch. The existing task offloading strategy often lacks accurate capture of resource relevance, and easily ignores constraint of inference precision when pursuing efficiency, which leads to delay increase or greatly reduced precision in a high concurrency scene, so that a fine-grained scheduling method capable of comprehensively considering node load, network condition and LLM inference characteristics is urgently needed. Disclosure of Invention Therefore, the invention aims to provide the inference task scheduling method of the large language model, which can comprehensively consider the node load, the network condition and the fine granularity scheduling of the LLM inference characteristics, schedule a proper amount of inference tasks to the edge side equipment for execution, and avoid network bandwidth bottleneck caused by uploading all data to the cloud, thereby obviously reducing the average response time and the communication overhead of the system when processing the high-concurrency internet of things task. Based on the above object, the present invention provides a method for scheduling inference tasks of a large language model, comprising: The method comprises the steps of monitoring the running states of N computing nodes in a network in real time, extracting multidimensional resource attributes of each computing node, and constructing resource feature vectors, wherein the computing nodes comprise cloud servers and edge devices; Calculating the resource difference degree of each computing node according to the resource feature vector of each computing node; outputting a scheduling scheme of the reasoning task based on a Markov decision process according to the resource difference degree and the real-time resource availability of each computing node and the metadata of the reasoning task to be processed; and dispatching the reasoning task to the corresponding computing node according to the dispatching scheme. Preferably, the calculating the resource difference degree of each computing node according to the resource feature vector of each computing node specifically includes: Calculating the resource diversity factor of the ith calculation node according to the following formula 1 : (Equation 1) Wherein, the For the resource feature vector of the ith compute node,For the resource mean vector in the i-th computation node neighborhood,Calculating a local covariance matrix of the resources in the node neighborhood for the ith calculation node; T represents a transpose operation, -1 represents matrix inversion; The resource attributes comprise computing unit computing power, memory residual capacity, network bandwidth fluctuation rate and configuration parameters of a currently loaded large language model LLM. Preferably, the outputting the scheduling scheme of the reasoning task based on the markov decision process according to the resource difference degree and the real-time resource availability of each computing node