CN-122019200-A - GPU (graphics processing unit) computing power resource scheduling method, system, equipment and medium

CN122019200ACN 122019200 ACN122019200 ACN 122019200ACN-122019200-A

Abstract

The application provides a method, a system, equipment and a medium for dispatching GPU computing power resources. According to the method, task information of at least one task to be scheduled is obtained, a corresponding task portrait is constructed based on the task information, resource state information of each computing node in the GPU cluster is obtained, a corresponding node state is constructed based on the resource state information, scheduling priority of the task to be scheduled is determined according to the task portrait and the node state, and then a target GPU resource combination for executing the task to be scheduled is determined according to the task portrait, the node state and the scheduling priority, so that the target GPU resource combination capable of improving the overall resource utilization rate, reducing the inter-task interference and shortening the task completion time while guaranteeing timely execution of the high-priority task is selected under the conditions of large-scale operation of the GPU cluster, multiple task types and dynamic change of resource requirements.

Inventors

LANG FENG

Assignees

杭州羿贝科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (20)

1. A method for scheduling GPU computing power resources, comprising: acquiring task information of at least one task to be scheduled, and constructing a corresponding task image based on the task information; acquiring resource state information of each computing node in the GPU cluster, and constructing a corresponding node state based on the resource state information; determining the scheduling priority of the task to be scheduled according to the task portraits and the node states; And determining a target GPU resource combination for executing the task to be scheduled according to the task portraits, the node states and the scheduling priority.
2. The GPU computing power resource scheduling method of claim 1, wherein the constructing a corresponding task image based on the task information comprises: Analyzing the task configuration of the task to be scheduled, and acquiring task type information, calculation force demand information and timeliness demand information; executing pre-running analysis on the task to be scheduled, and collecting GPU running indexes and I/O running indexes; And generating a task portrait corresponding to the task to be scheduled according to the task configuration, the GPU operation index and the I/O operation index.
3. The GPU power resource scheduling method of claim 2, wherein the performing a pre-run analysis on the task to be scheduled comprises: Distributing prerun GPU resources for the tasks to be scheduled in a preset GPU resource pool; a pre-running stage for running the task to be scheduled in a preset time length, and collecting the use information of the video memory and the calculation load information; determining the video memory peak value, the average utilization rate and the computing density characteristic of the task to be scheduled according to the video memory use information and the computing load information; and writing the video memory peak value, the average utilization rate and the computing density characteristic into the task portrait.
4. The method for scheduling GPU power resources according to claim 2 or 3, wherein the GPU operation index comprises at least one of a video memory usage curve, a stream processor utilization, a special matrix operation unit utilization, a thread bundle execution efficiency, an instruction type distribution, a global video memory bandwidth utilization, a cache hit rate, and an inter-GPU interconnection bandwidth occupancy.
5. The method for scheduling GPU power resources according to claim 2 or 3, wherein the I/O operation index comprises at least one of disk input/output throughput and network input/output throughput.
6. The GPU power resource scheduling method of claim 1, wherein the constructing a corresponding node state based on the resource state information comprises: For each computing node, acquiring the use state, the computing load state and the health state of the video memory of each GPU on the computing node; For each computing node, obtaining topology information between GPUs and between the GPUs and a central processing unit in the computing node; and generating a node state corresponding to each computing node according to the video memory use state, the computing load state, the health state and the topology information.
7. The method of GPU power resource scheduling according to claim 6, wherein the memory usage status comprises used memory capacity, remaining memory capacity, and memory fragmentation index.
8. The method for scheduling GPU power resources according to claim 6, wherein the computational load state comprises current utilization of a stream processor and a list of running tasks.
9. The method for scheduling GPU power resources according to claim 6, wherein the health status comprises at least one of temperature information, power consumption information, and error statistics.
10. The method for scheduling GPU power resources according to claim 6, wherein the topology information includes at least one of a type of interconnection between the GPUs, an interconnection bandwidth, and an affinity relationship between the GPU and the central processing unit.
11. The GPU power resource scheduling method of claim 2, wherein determining a scheduling priority of a task to be scheduled based on the task representation and the node status comprises: acquiring the static priority of the task to be scheduled according to the task configuration; determining time weight according to the timeliness requirement and the waiting time length of the task to be scheduled; Determining a critical path weight according to the position of the task to be scheduled in the task dependency graph; Determining resource adaptation weights according to the resource adaptation conditions of the task portraits and the node states; and determining the scheduling priority of the task to be scheduled based on the static priority, the time weight, the critical path weight and the resource adaptation weight.
12. The GPU computing power resource scheduling method of claim 11, wherein the determining a resource adaptation weight according to the resource adaptation condition of the task portraits and the node states comprises: Determining resource characteristics of tasks to be scheduled based on the computational intensity information, the video memory demand information and the interconnection bandwidth demand information reflected in the task portraits; determining node resource characteristics based on GPU available video memory, available computing power and interconnection bandwidth reflected in the node state; and determining the resource adaptation weight based on the similarity of the resource characteristics and the node resource characteristics.
13. The GPU power resource scheduling method of claim 1, wherein the determining a target GPU resource combination for executing the task to be scheduled based on the task representation, the node status, and the scheduling priority comprises: Determining affinity between a task and each candidate GPU resource combination according to the task portrait and a node state corresponding to the candidate GPU resource combination; aiming at each candidate GPU resource combination, determining the interference degree generated when the task to be scheduled is scheduled to the candidate GPU resource combination according to the task portraits of the existing running tasks on the candidate GPU resource combination; determining a matching score for the candidate GPU resource combination based on the scheduling priority, the affinity, and the interference; And determining the target GPU resource combination from a plurality of candidate GPU resource combinations according to the matching scores.
14. The GPU power resource scheduling method of claim 13, wherein the determining the affinity between a task and the candidate GPU resource combination comprises: Determining topology adaptation degree based on the parallelism demand information and the interconnection bandwidth demand information in the task portraits and topology information corresponding to the candidate GPU resource combination; Determining computing power adaptation degree based on the computing characteristic information and the video memory characteristic information in the task portrait and the available computing power information and the video memory availability information corresponding to the candidate GPU resource combination; Determining a local adaptation degree based on the data local demand information in the task portraits and the node data positions corresponding to the candidate GPU resource combinations; the affinity is determined based on the topology fitness, the computational power fitness, and the locality fitness.
15. The GPU power resource scheduling method of claim 13, wherein the determining the degree of interference generated when scheduling the task to be scheduled to the candidate GPU resource combination comprises: Determining the resource conflict degree according to the resource occupation mode in the task portrait and the resource occupation mode in the task portrait of the current running task on the candidate GPU resource combination; Determining the performance influence degree during concurrent execution according to the current video memory fragmentation index, the current stream processor load and the interconnection bandwidth occupation condition in the node state; The interference level is determined based on the resource conflict level and the performance impact level.
16. The GPU power resource scheduling method of claim 1, further comprising: Re-acquiring task figures of the running tasks and the tasks to be scheduled according to a preset period, and re-acquiring resource state information of each computing node in the GPU cluster; Recalculating the scheduling priorities and the corresponding matching scores of the running tasks and the tasks to be scheduled based on the updated task portraits and the updated node states; and determining whether to schedule and adjust at least one running task according to the recalculated scheduling priority and the matching score.
17. The GPU power resource scheduling method of claim 16, wherein the scheduling adjustment of at least one running task comprises: Determining at least one preemptible low-priority operation task under the condition that the high-priority task cannot acquire the required GPU resources within a preset time limit; Executing a state save operation on the low priority running task to generate a task checkpoint corresponding to the low priority running task; After the state saving operation is completed, releasing GPU resources allocated to the low-priority running task; Assigning the GPU resources to the high-priority tasks, and executing scheduling based on task portraits and corresponding node states of the high-priority tasks; And when the GPU resource is recovered to be available, recovering the low-priority running task according to the task check point and the updated node state.
18. The GPU power resource scheduling method of claim 17, wherein said determining at least one preemptible low priority running task comprises: determining task preemption cost based on the scheduling priority of the running task, task duration information and task check point interval information; Selecting a target low-priority operation task with the task preemption cost not higher than a preset threshold value from a plurality of low-priority operation tasks as the preemptible low-priority operation task.
19. The GPU power resource scheduling method of claim 1, further comprising: establishing parallelism capability information for at least one task to be scheduled, wherein the parallelism capability information is used for representing performance change relations under different GPU numbers; determining initial GPU quantity for tasks to be scheduled according to the current GPU cluster load and the parallelism capability information; and dynamically adjusting the quantity of the GPUs distributed to the tasks to be scheduled based on the GPU cluster load change and the parallelism capability information in the running process of the tasks to be scheduled.
20. The GPU computing power resource scheduling method of claim 19, wherein the constructing parallelism capability information for at least one task to be scheduled comprises: In a pre-running analysis stage, executing a plurality of pre-running experiments on tasks to be scheduled under different GPU quantity configurations; collecting throughput and single step execution duration of each pre-run experiment; Determining performance indexes corresponding to different GPU numbers based on the throughput and single step execution duration of each pre-running experiment; And associating the performance index with the corresponding quantity of the GPUs to generate the parallelism capability information.

Description

GPU (graphics processing unit) computing power resource scheduling method, system, equipment and medium Technical Field The application relates to a computing power scheduling technology, in particular to a method, a system, equipment and a medium for scheduling GPU computing power resources. Background The conventional GPU cluster scheduling generally adopts a coarse granularity strategy mainly based on queue order or priority, and scheduling decision is mainly based on limited information such as task submitting time, static resource requests, node residual video memory or idle GPU quantity and the like. The resource scheduling judgment based on the rough parameters often causes the situations of high priority task queuing waiting, low GPU actual utilization rate, key tasks distributed to nodes with unsuitable topology or serious resource fragmentation and the like in an actual large-scale multi-tenant cluster scene, thereby bringing the problems of overall computational waste and unstable task completion time. Disclosure of Invention The application provides a method, a system, equipment and a medium for scheduling GPU (graphics processing unit) computational resources, which are used for solving the technical problems of overall computational waste and unstable task completion time caused by resource scheduling judgment carried out by rough parameters. In a first aspect, the present application provides a GPU power resource scheduling method, including: acquiring task information of at least one task to be scheduled, and constructing a corresponding task image based on the task information; acquiring resource state information of each computing node in the GPU cluster, and constructing a corresponding node state based on the resource state information; determining the scheduling priority of the task to be scheduled according to the task portraits and the node states; And determining a target GPU resource combination for executing the task to be scheduled according to the task portraits, the node states and the scheduling priority. In a second aspect, the present application provides a GPU power resource scheduling system, comprising: The task information acquisition module is used for acquiring task information of at least one task to be scheduled and constructing a corresponding task image based on the task information; the state information acquisition module is used for acquiring the resource state information of each computing node in the GPU cluster and constructing a corresponding node state based on the resource state information; the priority determining module is used for determining the scheduling priority of the task to be scheduled according to the task portrait and the node state; the matching score determining module is used for determining a matching score of the task to be scheduled and a target GPU resource combination according to the task portrait, the node state and the scheduling priority; and the processing resource scheduling module is used for scheduling the tasks to be scheduled to the corresponding target GPU resources for combined execution according to the matching scores. In a third aspect, the present application provides an electronic device comprising: Processor, and A memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the possible methods described in the first aspect via execution of the executable instructions. In a fourth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out any one of the possible methods described in the first aspect. According to the GPU computing power resource scheduling method, system, equipment and medium, task information of at least one task to be scheduled is obtained, corresponding task portraits are constructed based on the task information, resource state information of all computing nodes in a GPU cluster is obtained, corresponding node states are constructed based on the resource state information, scheduling priority of the task to be scheduled is determined according to the task portraits and the node states, and then target GPU resource combinations for executing the task to be scheduled are determined according to the task portraits, the node states and the scheduling priority, so that task portraits and the node states capable of truly reflecting the task resource requirements and the cluster resource states are constructed before scheduling decisions under the conditions that the GPU cluster is operated in a large scale and the task types are various and the resource requirements are dynamically changed, and the matching degree of the scheduling priority of the task to be scheduled and the candidate GPU resource combinations is quantified on the basis, so that the target GPU resource combinations capable of improving the overall