CN-122019168-A - GPU cluster resource allocation method, device, equipment and storage medium

CN122019168ACN 122019168 ACN122019168 ACN 122019168ACN-122019168-A

Abstract

The application discloses a method, a device, equipment and a storage medium for distributing GPU cluster resources, and relates to the technical field of GPU cluster resource distribution. The method generates a hardware performance description vector by fusing static hardware parameters and dynamic micro-benchmark test data, breaks through the limitation that GPU performance is represented only by the static parameters, accurately matches actual running capability of GPUs of different architectures in heterogeneous clusters, generates task feature vectors by extracting task calculation features and resource requirements, achieves quantitative description of reasoning task resource consumption features, integrates information of hardware, tasks and load dimensions through a machine learning model, effectively captures performance degradation features of multi-task parallel operation, improves accuracy of execution time prediction, and combines predicted execution time to execute scheduling decision operation comprising candidate node screening, comprehensive cost evaluation and resource reservation backfill, thereby achieving task execution efficiency and cluster load balance.

Inventors

CHEN YANQI
WU CHUNPENG
YE QINGHE
JIANG CONGFENG
WANG YUE
LIU JUNMING
LI YONG
WANG YUNXIAO

Assignees

杭州电子科技大学
中国电力科学研究院有限公司
国网山东省电力公司信息通信公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. A method for GPU cluster resource allocation, the method comprising: the method comprises the steps of obtaining performance data of a heterogeneous GPU cluster, and generating a hardware performance description vector based on the performance data, wherein the performance data comprises static hardware parameters and dynamic hardware parameters, and the dynamic hardware parameters are obtained by executing dynamic micro-benchmark tests on each GPU node; extracting computing characteristics and resource requirements of an inference task to be scheduled, and generating a task characteristic vector; Collecting the current load state of each GPU node, and inputting the hardware performance description vector, the task feature vector and the current load state into a pre-trained machine learning model to obtain the estimated execution time of the reasoning task to be scheduled in each GPU node; and executing a scheduling decision operation based on the estimated execution time.
2. The method of claim 1, wherein the static hardware parameters include CUDA kernel number, boost frequency, and memory capacity, the dynamic micro-benchmark test includes GEMM calculation force test and memory bandwidth test, the dynamic micro-benchmark test is used to obtain dynamic hardware parameters, the dynamic hardware parameters include real calculation force value and actual measured memory bandwidth value, the calculation features include FLOPs, operator type, and parameter quantity, and the resource requirements include memory occupancy peaks.
3. The method according to claim 1, wherein inputting the hardware performance description vector, the task feature vector, and the current load state into a pre-trained machine learning model to obtain the estimated execution time of the task to be scheduled at each GPU node comprises: Carrying out standardized processing on the hardware performance description vector, the task feature vector and the current load state, and splicing the processed data into a unified input vector; and inputting the unified input vector into a pre-trained machine learning model, capturing performance degradation characteristics of the multi-task parallel operation through the machine learning model, and outputting the estimated execution time of the to-be-scheduled reasoning task at each GPU node.
4. The method of claim 1, wherein the performing a scheduling decision operation based on the projected execution time comprises: screening and obtaining candidate nodes based on node resource hard constraint and model mutual exclusion rules; calculating the comprehensive cost score of each candidate node by combining the predicted execution time, the data transmission time and the load balancing factor, and selecting the node with the optimal comprehensive cost score; And if the adaptive node is not screened, executing resource reservation or backfilling operation according to the task resource demand type, and completing the scheduling of the to-be-scheduled reasoning task.
5. The method of claim 4, wherein the screening candidate nodes based on node resource hard constraints and model mutual exclusion rules comprises: Determining node resource hard constraint and model mutual exclusion rules, wherein the node resource hard constraint comprises video memory capacity adaptation constraint and CPU utilization threshold constraint, and the model mutual exclusion rules are used for prohibiting two video memory intensive models from being scheduled to the same GPU; Filtering out nodes which do not meet the video memory capacity adaptation requirement based on the video memory capacity adaptation constraint, and filtering out nodes with CPU utilization rate reaching a threshold based on the CPU utilization rate threshold constraint; And eliminating nodes with two video memory intensive models and GPU scheduling conflict risks based on the model mutual exclusion rules to obtain candidate nodes.
6. The method of claim 4, wherein the resource reservation or backfill operation comprises: Judging whether the memory occupation amount and the calculated amount of the reasoning task to be scheduled exceed the resource demand threshold, if so, judging the task to be a large resource demand type task, and if not, judging the task to be a small resource demand type task; for a large resource demand type task, checking whether a resource reservation can be created at a node with the optimal comprehensive cost score, if so, creating a resource window period, and not scheduling other models in the resource window period, and executing the large resource demand type task after the resource window period takes effect; And searching whether a bit filling gap exists in a reserved resource window period for a small resource demand type task, if so, calculating the expected operation time length of the reasoning task to be scheduled, if the expected operation time length is smaller than the bit filling gap time length, scheduling the small resource demand type task to the bit filling gap of a corresponding node for execution, and if the bit filling gap is not found, returning the small resource demand type task to a task queue.
7. The method of claim 1, further comprising a pre-preparation step that is completed before the first execution of the obtaining performance data of the heterogeneous GPU cluster and that is performed in addition to a subsequent scheduling process, the pre-preparation step comprising: executing static hardware parameter acquisition and dynamic micro-benchmark test when a GPU cluster is initialized or a new node is added, and generating a hardware performance description vector of a corresponding node; and before the performance data of the heterogeneous GPU clusters are acquired for the first time, collecting historical operation data of different model combinations on the heterogeneous GPU clusters, constructing a training data set, training the machine learning model by using the training data set, and subsequently updating the training data set based on the new historical operation data to carry out iterative training on the machine learning model.
8. A GPU cluster resource allocation device, the device comprising: The system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring performance data of a heterogeneous GPU cluster and generating a hardware performance description vector based on the performance data, the performance data comprises static hardware parameters and dynamic hardware parameters, and the dynamic hardware parameters are obtained by executing dynamic micro-benchmark tests on each GPU node; The feature generation unit is used for extracting the calculation features and resource requirements of the reasoning tasks to be scheduled and generating task feature vectors; the acquisition unit is also used for acquiring the current load state of each GPU node; the reasoning unit is used for inputting the hardware performance description vector, the task feature vector and the current load state into a pre-trained machine learning model to obtain the estimated execution time of the to-be-scheduled reasoning task at each GPU node; and the execution unit is used for executing scheduling decision operation based on the expected execution time.
9. A computing device, wherein the computing device comprises a memory, a processor; the memory is used for storing a computer program; The processor being adapted to carry out the steps of the method according to any one of claims 1 to 7 when said computer program is executed.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.

Description

GPU cluster resource allocation method, device, equipment and storage medium Technical Field The present application relates to the field of GPU cluster resource allocation technologies, and in particular, to a GPU cluster resource allocation method, device, equipment, and storage medium. Background With the improvement of the scale and complexity of the artificial intelligence reasoning task, the heterogeneous GPU cluster becomes a core infrastructure supporting large-scale reasoning deployment by virtue of flexible hardware combination and strong parallel computing capability, and the scheduling efficiency directly determines the cluster resource utilization rate and the job completion efficiency. The existing GPU cluster resource allocation method is characterized by multiple dependent GPU static hardware parameters, lacks of deep extraction of task computing characteristics, does not introduce an effective mechanism to predict multi-task parallel resource interference and execution time, and lacks of accurate screening rules and multidimensional cost evaluation for scheduling decisions, and resource reservation and backfill dependent experience estimation. The problems cause that the prior art cannot effectively sense the performance difference of heterogeneous hardware and the multi-task parallel interference, and finally the cluster resource utilization rate is low, the overall operation completion time is too long, and the high-efficiency scheduling requirement of reasoning tasks under the heterogeneous GPU cluster is difficult to meet. Disclosure of Invention In order to solve the above problems, the present application provides a GPU cluster resource allocation method, device, equipment and storage medium, including the following contents: in a first aspect, the present application provides a GPU cluster resource allocation method, which includes: the method comprises the steps of obtaining performance data of a heterogeneous GPU cluster, and generating a hardware performance description vector based on the performance data, wherein the performance data comprises static hardware parameters and dynamic hardware parameters, and the dynamic hardware parameters are obtained by executing dynamic micro-benchmark tests on each GPU node; extracting computing characteristics and resource requirements of an inference task to be scheduled, and generating a task characteristic vector; Collecting the current load state of each GPU node, and inputting the hardware performance description vector, the task feature vector and the current load state into a pre-trained machine learning model to obtain the estimated execution time of the reasoning task to be scheduled in each GPU node; and executing a scheduling decision operation based on the estimated execution time. Optionally, the static hardware parameters comprise CUDA core number, boost frequency and video memory capacity, the dynamic micro-benchmark test comprises GEMM calculation force test and video memory bandwidth test, the dynamic micro-benchmark test is used for obtaining dynamic hardware parameters, the dynamic hardware parameters comprise real calculation force values and actual measurement video memory bandwidth values, the calculation characteristics comprise FLOPs, operator types and parameter amounts, and the resource requirements comprise video memory occupation peaks. Optionally, inputting the hardware performance description vector, the task feature vector and the current load state into a pre-trained machine learning model, and obtaining the estimated execution time of the to-be-scheduled reasoning task at each GPU node includes: Carrying out standardized processing on the hardware performance description vector, the task feature vector and the current load state, and splicing the processed data into a unified input vector; and inputting the unified input vector into a pre-trained machine learning model, capturing performance degradation characteristics of the multi-task parallel operation through the machine learning model, and outputting the estimated execution time of the to-be-scheduled reasoning task at each GPU node. Optionally, the performing a scheduling decision operation based on the estimated execution time includes: screening and obtaining candidate nodes based on node resource hard constraint and model mutual exclusion rules; calculating the comprehensive cost score of each candidate node by combining the predicted execution time, the data transmission time and the load balancing factor, and selecting the node with the optimal comprehensive cost score; And if the adaptive node is not screened, executing resource reservation or backfilling operation according to the task resource demand type, and completing the scheduling of the to-be-scheduled reasoning task. Optionally, the screening the candidate node based on the node resource hard constraint and the model mutual exclusion rule includes: Determining node resource hard constraint