CN-122021936-A - Large model reasoning optimization method, system and equipment based on Kubernetes

CN122021936ACN 122021936 ACN122021936 ACN 122021936ACN-122021936-A

Abstract

The application provides a large model reasoning optimization method, a system and equipment based on Kubernetes, wherein the method constructs a resource scheduling component aiming at a large model and monitors the real-time resource state at the current moment; the method comprises the steps of constructing a batch processing agent component aiming at a large model, carrying out feature analysis on a received reasoning request queue, determining request features corresponding to each reasoning request in the reasoning request queue, evaluating real-time resource states according to the request features, adjusting batch processing size based on evaluation results to generate optimal batch processing length, carrying out batch processing on the reasoning requests in the reasoning request queue according to the optimal batch processing length, packaging each batch of reasoning requests into batch data, and utilizing a resource scheduling component to distribute the batch data to Pod in a K8s cluster for reasoning to obtain a large model reasoning result.

Inventors

WANG XIAOHU
RAN XIAOLONG

Assignees

广域铭岛数字科技有限公司
浙江吉利控股集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260330

Claims (10)

1. The large model reasoning optimization method based on the Kubernetes is characterized by comprising the following steps of: constructing a resource scheduling component aiming at the large model, and monitoring the real-time resource state at the current moment; Constructing a batch processing agent component aiming at the large model, carrying out feature analysis on a received reasoning request queue, and determining request features corresponding to each reasoning request in the reasoning request queue; evaluating the real-time resource state according to the request characteristics, and adjusting the batch processing size based on the evaluation result to generate an optimal batch processing length; Batching the reasoning requests in the reasoning request queue according to the optimal batch processing length, and packaging each batch of reasoning requests into batch data; and utilizing the resource scheduling component to distribute the batch data to Pod in the K8s cluster for reasoning so as to obtain a large model reasoning result.
2. The Kubernetes-based large model reasoning optimization method of claim 1, wherein the real-time resource state comprises a GPU available video memory and a GPU computing utilization rate, and the request features comprise a sequence length and a pre-estimated computation amount; Wherein, evaluate the real-time resource status according to the request feature, adjust the batch size based on the evaluation result to generate an optimal batch length, comprising: Analyzing the sequence length in each request characteristic, determining the number of long requests and the number of short requests, and determining the request number and the waiting time of the reasoning request queue according to the number of long requests and the number of short requests; And calculating target throughput and target delay of the current execution batch according to the request quantity and the waiting time, and dynamically adjusting the batch to determine the optimal batch size as the optimal batch processing length by taking the target throughput reaching the maximum throughput as a target and taking the target delay smaller than the maximum delay as a constraint, wherein the maximum throughput is determined by the GPU available video memory and the GPU calculation utilization rate.
3. The Kubernetes-based large model inference optimization method of claim 2, wherein dynamically adjusting the lot size to an optimal lot size with the target throughput up to a maximum throughput and the target delay less than the maximum delay as a constraint comprises: Collecting target throughput and target delay of the large model at the current moment, and taking the target throughput reaching the maximum throughput as a target, wherein the target delay is smaller than the maximum delay and is used for restraining the self-adaptive adjustment of the batch size; If the short request quantity is larger than a first preset threshold value and the GPU available video memory is larger than the preset video memory, determining to adopt the batch processing length of a large batch as the optimal batch processing length; if the number of the long requests is larger than a second preset threshold, determining that the batch processing length of the small batch is used as the optimal batch processing length, wherein the first preset threshold is larger than the second preset threshold; and if the short request quantity is not greater than a first preset threshold value or the long request quantity is not greater than a second preset threshold value, determining to adopt the batch processing length of the medium batch as the optimal batch processing length, wherein the batch processing lengths of the large batch, the medium batch and the small batch are gradually decreased from large to small.
4. The Kubernetes-based large model reasoning optimization method of claim 2, further comprising, before packaging each batch of reasoning requests into batch data, if it is detected that the reasoning requests with preset priority levels exist in any batch of data, sorting the reasoning requests with preset priority levels in the batch of data by means of queue-inserting processing, so that the reasoning requests with preset priority levels can be processed preferentially, wherein the request characteristics carry priorities corresponding to the reasoning requests.
5. The Kubernetes-based large model reasoning optimization method of claim 1, wherein reasoning with the resource scheduling component assigning the lot data to Pod in K8s cluster comprises: Acquiring each Pod resource requirement, wherein each Pod resource requirement comprises at least one of a CPU, a GPU and a memory; Filtering the Pod resource demands based on preset node filtering conditions, and determining a candidate node list, wherein the preset node filtering conditions comprise node resource sufficiency, GPU model matching and node label matching; Calculating node scores in the candidate node list and comprehensive scores of all the nodes; And comparing the comprehensive scores of the nodes, and selecting the node with the highest comprehensive score to execute Pod for reasoning.
6. The Kubernetes-based large model reasoning optimization method of claim 5, wherein calculating the node score and the composite score for each node in the candidate node list comprises: Calculating node scores in each candidate node, the node scores including GPU performance scores, historical performance scores, real-time load scores, and affinity scores; and weighting and calculating the GPU performance score, the historical performance score, the real-time load score and the affinity score in each candidate node to obtain a comprehensive score of each node.
7. The Kubernetes-based large model reasoning optimization method of claim 5, further comprising: And simultaneously, constructing a resource scheduling component meeting preset node filtering conditions and comprehensive scoring of nodes, compiling and linking the resource scheduling component into a binary file, and generating a scheduling executable file so as to enable the highest-scoring node to execute the scheduling executable file for reasoning.
8. The Kubernetes-based large model reasoning optimization method of any of claims 1-7, further comprising: determining a traffic level indicator based on the batch agent component, the traffic level indicator comprising at least a maximum throughput and a maximum delay; inquiring the business level index to determine a matched target load; And if the target load is greater than a preset load, generating a capacity expansion request and creating a new Pod.
9. A Kubernetes-based large model inference optimization system, comprising: The scheduling component construction module is used for constructing a resource scheduling component aiming at the large model and monitoring the real-time resource state at the current moment; The agent component construction module is used for constructing a batch processing agent component aiming at the large model, carrying out feature analysis on a received reasoning request queue and determining request features corresponding to each reasoning request in the reasoning request queue; The batch processing module is used for evaluating the real-time resource state according to the request characteristics, and adjusting the batch processing size based on the evaluation result so as to generate the optimal batch processing length; The batch packing module is used for batching the reasoning requests in the reasoning request queue according to the optimal batch processing length and packing each batch of reasoning requests into batch data; And the model reasoning module is used for distributing the batch data to Pod in the K8s cluster to conduct reasoning by using the resource scheduling component, so as to obtain a large model reasoning result.
10. An electronic device comprising a processor, a memory and a communication bus for connecting the processor and the memory, the processor being configured to execute a computer program stored in the memory to implement the Kubernetes-based large model inference optimization method according to any of claims 1 to 8.

Description

Large model reasoning optimization method, system and equipment based on Kubernetes Technical Field The application relates to the technical field of cloud computing and artificial intelligence, in particular to a large model reasoning optimization method, a system and equipment based on Kubernetes. Background Kubernetes is abbreviated as K8s, is a portable container arrangement management tool for container service, and is based on the strong load of the Kubernetes platform to derive a plurality of complex business application scenes, and multiple types of workloads, including batch processing, service deployment and the like, are usually required to be processed simultaneously. The main disadvantage of static batch processing is that the static batch processing lacks flexibility, the fixed batch size cannot be dynamically adjusted according to the data characteristics or hardware resources, so that the hardware utilization rate is low, the situations of memory waste or memory shortage can occur, and meanwhile, when variable-length data or data distribution is unbalanced, the static batch processing is difficult to effectively process, and the stability and performance of model training can be influenced. In addition, the fixed batch size cannot meet the requirement of real-time data processing, which may lead to delay increase or resource idling, and reduce the efficiency and response speed of the whole system. However, at present, K8s is used as a native scheduler, and cannot effectively sense dynamic characteristics of large model reasoning, such as variable request length and uneven computation density, which easily causes low resource utilization rate of GPU (Graphics Processing Unit, graphics processor), and further causes high reasoning delay and unstable throughput of reasoning service. Disclosure of Invention The application provides a large model reasoning optimization method, a system and equipment based on Kubernetes, which are used for solving the problems of low GPU resource utilization rate, high reasoning delay of reasoning service and unstable throughput caused by the fact that the large model reasoning optimization method cannot sense large model reasoning characteristics in the prior art. In a first aspect, the Kubernetes-based large model reasoning optimization method includes the steps of constructing a resource scheduling component for a large model, monitoring real-time resource states at the current moment, constructing a batch processing proxy component for the large model, conducting feature analysis on received reasoning request queues, determining request features corresponding to all reasoning requests in the reasoning request queues, evaluating the real-time resource states according to the request features, adjusting batch sizes based on evaluation results to generate optimal batch processing lengths, conducting batch processing on the reasoning requests in the reasoning request queues according to the optimal batch processing lengths, packaging each batch of reasoning requests into batch data, and utilizing the resource scheduling component to conduct reasoning on Pod in a K8s cluster distributed with the batch data to obtain large model reasoning results. In some possible embodiments of the first aspect, the real-time resource state includes a GPU available video memory and a GPU computing utilization, the request feature includes a sequence length and a pre-estimated computation amount, the real-time resource state is evaluated according to the request feature, a batch processing size is adjusted based on an evaluation result to generate an optimal batch processing length, the real-time resource state includes analyzing the sequence length in each request feature, determining a long request number and a short request number, determining a request number and a waiting time of a reasoning request queue according to the long request number and the short request number, calculating a target throughput and a target delay of a currently executing batch according to the request number and the waiting time, taking the target throughput as a target, and the target delay is smaller than the maximum delay as a constraint, dynamically adjusting the batch size to determine the optimal batch processing length, and the maximum throughput is determined by the GPU available video memory and the GPU computing utilization. In some possible embodiments of the first aspect, dynamically adjusting the batch size with the target throughput reaching the maximum throughput and the target delay being less than the maximum delay as a constraint includes collecting the target throughput and the target delay of the large model at the current time and with the target throughput reaching the maximum throughput and the target delay being less than the maximum delay as a constraint, adaptively adjusting the batch size, determining to use the batch processing length of the large batch as the optimal batch p