CN-121979676-A - GPU shared scheduling method and shared scheduling system based on computing power perception

CN121979676ACN 121979676 ACN121979676 ACN 121979676ACN-121979676-A

Abstract

The invention provides a GPU sharing scheduling method based on computing power perception, which comprises the steps of S1, receiving a task request and analyzing the resource request, wherein the resource request represents the quantity of GPU (graphics processing unit) computing power resources required by the task request, or the quantity of GPU video memory resources required by the task request, or the quantity of GPU computing power resources required by the task request and the GPU video memory resources, the resource request corresponds to a request identifier, the request identifier represents whether one or more nodes or one or more GPU models are designated for the task request, S2, screening all nodes meeting the resource request and the request identifier from a GPU cluster as candidate nodes, and S3, selecting a scheduling strategy from preset multiple scheduling strategies to perform resource scheduling so as to select a specific GPU from all the candidate nodes to execute the task request.

Inventors

TANG HONGWEI
ZHONG ZHENYANG

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260505
Application Date: 20260113

Claims (10)

1. A GPU sharing scheduling method based on computing power perception, for implementing resource scheduling in a cloud computing environment to allocate task requests to specific GPUs in a GPU cluster, wherein the GPU cluster includes a plurality of nodes, each node includes a plurality of GPUs, and GPUs in the same node are configured in the same model, different models or a mixture of the two, the method is characterized in that: step S1, receiving a task request and analyzing the resource request, wherein the resource request represents the quantity of GPU (graphics processing Unit) needed by the task request, or the quantity of GPU (graphics processing Unit) computing resources needed by the task request and the GPU (graphics processing Unit) computing resources, and the resource request corresponds to a request identifier which represents whether the task request is assigned with one or more nodes or one or more GPU models; Step S2, based on the resource request and the corresponding request identification, all nodes meeting the resource request and the request identification are screened from the GPU cluster to serve as candidate nodes; and step S3, selecting a scheduling strategy from a plurality of preset scheduling strategies to perform resource scheduling based on the resource request and all the candidate nodes obtained by screening, so as to select a specific GPU from all the candidate nodes to execute the task request.
2. The method of claim 1, wherein each GPU includes a plurality of streaming multiprocessors thereon, the method further comprising: Taking the streaming multiprocessors as minimum division units of the GPU computing power, and taking the streaming multiprocessors as GPU unit computing power resources to divide the GPU computing power into a plurality of unit computing power resources for independent scheduling; Dividing the GPU video memory, so that the GPU video memory is divided into different storage spaces for independent scheduling; and configuring each GPU in one node or a plurality of nodes as a scheduling whole, so that the computing power resource and the video memory resource of the GPU are independently scheduled as a whole.
3. The method of claim 2, wherein the predetermined plurality of scheduling policies comprises a single-card shared scheduling policy, a single-resource-dimension shared scheduling policy, and a multi-resource-dimension shared scheduling policy, wherein: when the resource request represents the number of GPUs required by the task request, selecting one candidate node from all candidate nodes as a target node by adopting a single-card sharing scheduling strategy, and selecting one or more GPUs from the target node to execute the task request so as to minimize the idle GPU on the target node; When the resource request represents the number of GPU unit power resources required by the task request or GPU video memory resources required by the task request, selecting one GPU from all GPUs contained in all candidate nodes by adopting a single-resource dimension sharing scheduling strategy to execute the task request, so that the power utilization rate or the video memory utilization rate on the GPU is highest; When the resource request represents the quantity of GPU unit power resources and GPU video memory resources required by the task request, selecting one GPU from all GPUs contained in all candidate nodes by adopting a multi-resource dimension sharing scheduling strategy to execute the task request, so that the GPU has the highest load.
4. The method of claim 3, wherein the single card shared scheduling policy is a resource schedule that implements task requests as follows: determining the number of GPUs required for the task request based on the resource request; analyzing the self-remained GPU quantity after each candidate node distributes the GPU quantity required by the task request based on the request identification and the candidate nodes obtained by screening, and selecting the candidate node with the smallest residual GPU quantity as a target node; one or more GPUs satisfying the resource request and the request identification are selected from the target node to execute the task request.
5. A method according to claim 3, wherein the single resource dimension shared scheduling policy is a resource scheduling that implements task requests in the following manner: Determining the quantity of GPU unit power resources required by a task request or GPU video memory resources required by the task request based on the resource request; Based on the request identification and the candidate nodes obtained by screening, analyzing the quantity of GPU unit computational power resources required by each GPU for distributing task requests or the computational power utilization rate or the video memory utilization rate of each GPU after the GPU video memory resources are distributed in all candidate nodes, and selecting the GPU with the highest computational power utilization rate or video memory utilization rate to execute the task requests.
6. A method according to claim 3, wherein the multi-resource dimension sharing scheduling policy is resource scheduling to implement task requests as follows: Determining the quantity of GPU unit power resources and GPU video memory resources required by a task request based on the resource request; Based on the request identification and the candidate nodes obtained by screening, analyzing the quantity of GPU unit computing power resources required by each GPU in all candidate nodes for distributing task requests and the computing power utilization rate and the display memory utilization rate on each GPU after GPU display memory resources, and calculating each GPU load based on the computing power utilization rate and the display memory utilization rate so as to select the GPU with the highest load to execute the task requests.
7. The method of any of claims 4-6, further comprising a rescheduling policy, wherein the rescheduling policy is to implement resource rescheduling of task requests in the following manner: Detecting task requests corresponding to each GPU in the GPU cluster in a fixed time period; And selecting a scheduling strategy from the single-card sharing scheduling strategy, the single-resource dimension sharing scheduling strategy and the multi-resource dimension sharing scheduling strategy to perform resource scheduling based on the task request corresponding to each GPU and the corresponding resource request and request identification thereof, so as to allocate the corresponding GPU for each task request again.
8. A GPU sharing scheduling system based on the method of any of claims 1-7, for implementing resource scheduling in a cloud computing environment to distribute task requests to specific GPU execution in a GPU cluster, the system comprising a heterogeneous task request layer, a cluster scheduling layer, a device binding layer, and an underlying resource control layer, wherein: the heterogeneous task request layer is used for storing task requests; The cluster scheduling layer is configured with a GPU scheduler, a scheduling extender, a dynamic scheduler and a device controller, wherein the GPU scheduler is used for realizing resource scheduling of task requests, the scheduling extender is configured with a plurality of preset scheduling strategies to assist the GPU scheduler in realizing resource scheduling of task requests, the dynamic scheduler is configured with a rescheduling strategy to realize resource rescheduling of task requests, and the device controller is used for receiving and executing scheduling decisions of the task requests sent by the GPU scheduler; The device binding layer is configured with a device resource management module, a device maintenance module and a device binding module, wherein the device resource management module is used for implementing monitoring of device information of each GPU in the GPU cluster, the device maintenance module is used for managing scheduling states of each GPU in the GPU cluster, and the device binding module is used for realizing binding of task requests and the GPUs so that the GPUs can execute corresponding task requests; the system comprises a bottom resource control layer, wherein the bottom resource control layer is configured with a GPU cluster, the GPU cluster comprises a plurality of nodes, each node comprises a plurality of GPUs, and the GPUs in the same node are configured in the same model, different models or a mixture of the two.
9. A computer readable storage medium, having stored thereon a computer program, the computer program being executed by a processor to implement the steps of the method of any of claims 1-7.
10. An electronic device, comprising: one or more processors, and memory, wherein the memory is to store executable instructions; The one or more processors are configured to implement the steps of the method of any of claims 1-7 via execution of the executable instructions.

Description

GPU shared scheduling method and shared scheduling system based on computing power perception Technical Field The invention relates to the field of cloud computing heterogeneous resource scheduling, in particular to a GPU scheduling technology in the field of cloud computing heterogeneous resource scheduling, and more particularly relates to a GPU sharing scheduling method and a sharing scheduling system based on computational power awareness. Background At the moment of the rapid development of artificial intelligence technology, a Graphic Processor (GPU) has become an indispensable computing carrier in tasks such as cloud model reasoning and training, scientific computing and machine learning by virtue of the multi-core and highly parallel computing characteristics of the GPU. As an efficient heterogeneous computing core component, the GPU remarkably overcomes the defects of the traditional CPU in terms of computational power and energy efficiency under a specific scene, and lays the key position of the GPU in a heterogeneous computing architecture. With the continuous proliferation of computing power demands, heterogeneous computing builds a computing paradigm with high industrialization feasibility by integrating multiple processing units such as a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like, and becomes an important path for coping with the current energy efficiency bottleneck. Although heterogeneous computing has significant advantages in concept, in the actual process of landing, especially in a cloud computing environment, the scheduling and management of GPU resources still face many challenges. The current mainstream cloud computing platform has the following defects in GPU resource scheduling. First, in terms of resource allocation mechanism, the conventional GPU scheduling mode mostly adopts a whole card exclusive policy. The method is simple to realize, but is difficult to adapt to the differentiated requirements of heterogeneous task requests on resources. For example, deep learning reasoning tasks are typically more sensitive to GPU power, while large-scale training tasks rely more on video memory capacity. Most of the existing schedulers only take the GPU card number as the minimum allocation unit, and lack of fine granularity segmentation and overall planning on binary dimensions of computing power and video memory, so that fragmentation and idling of single-dimension resources (such as video memory or computing power) are easy to occur in the resource allocation process, and the overall resource utilization rate is seriously affected. Second, at the scheduling policy level, although several GPU sharing schemes (e.g., alicloud GPU Share, tencent clouds qGPU and KubeShare, etc.) have been proposed in the academia and industry, these schemes still have significant limitations in terms of applicable scenarios and resource dimension support. The early scheme is that GPU Share only takes video memory as a scheduling basis, consideration of computing power resources and isolation guarantee among tasks are lacked, qGPU supports whole-card scheduling optimization, but the framework is complex, the deployment threshold is high, the method is difficult to be applied to medium-small scale scenes, kubeShare realizes GPU resource pooling, but two key resource dimensions of computing power and video memory cannot be effectively distinguished, and cooperative optimization of multidimensional resources cannot be realized. These limitations make it difficult for existing scheduling methods to achieve efficient utilization of computing power and video memory while guaranteeing resource isolation. Third, the current mainstream container orchestration platform (e.g., kubernetes) still has weak native support for GPUs in terms of the design of the scheduling mechanism. The scheduling system is mainly constructed around the CPU and the memory resources, and the GPU is used as external equipment for management through an external plug-in (such as device-plug in). This mechanism results in the GPU not being flexibly scheduled as an "equal resource" to the CPU, memory, etc., and not being incorporated into the fine resource management and control architecture based on the control group (CGroup). In addition, the node resource evaluation method is generally based on the overall resource condition of the node, does not consider the multi-dimensional resource fragmentation condition in a single GPU, and is difficult to realize accurate resource matching and boxing optimization in a shared environment. Fourth, in terms of dynamic management of resources, the existing scheduling method generally relies on static resource quota to make scheduling decisions, and lacks a sensing and feedback mechanism for resource occupation during actual running of tasks. When the resource demand of the task declaration and the actual usage amount are significantly different, the resource view maintained b