CN-121501524-B - Automatic aggregation scheduling method for improving GPU computing power utilization rate

CN121501524BCN 121501524 BCN121501524 BCN 121501524BCN-121501524-B

Abstract

The invention relates to the technical field of data processing, in particular to an automatic aggregation scheduling method for improving the computing power utilization rate of a Graphic Processing Unit (GPU), which comprises the steps of collecting GPU operation data and a computing task set, processing the operation data to construct a resource information base, extracting the characteristics of the computing task set to obtain standard task characteristics and classifying the standard task characteristics to generate a task queue, aggregating GPU fragments meeting the conditions based on the resource information base to form a local computing power pool, determining a scheduling mode according to the comparison result of the total computing power of the computing power pool and the task requirement, obtaining the accuracy of the task queue and the local computing power Chi Pipei, and judging the computing power utilization rate of the GPU to improve the effectiveness. If not, judging whether to increase the GPU fragment elastic redundancy threshold, if not, acquiring the running data deletion rate, judging the data acquisition integrity, if not, judging whether to reduce the node resource load adaptation factor, and if not, determining the GPU fragment architecture fluctuation suppression threshold based on the calculation fluctuation amplitude. The invention improves the improvement effectiveness of the GPU calculation power utilization rate.

Inventors

ZHANG MINGYUE

Assignees

北京基流科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (7)

1. An automatic aggregation scheduling method for improving the power utilization rate of a GPU (graphics processing Unit), which is characterized by comprising the following steps: Respectively acquiring operation data of a GPU and a calculation task set, processing the operation data to obtain a resource information base, extracting characteristics of the calculation task set to obtain standard task characteristics, and classifying the calculation task set based on the standard task characteristics to obtain a task queue; The GPU fragments meeting preset conditions in all nodes are aggregated based on the resource information base to obtain a local computing pool, and the scheduling mode of the task queue is determined based on the comparison result of the total computing force of the local computing pool and the task computing force demand; acquiring the matching accuracy of a task queue and a local computing pool, and determining whether the improvement effectiveness of the GPU computing power utilization rate meets the requirement or not based on the matching accuracy of the task queue and the local computing pool; if the improvement effectiveness of the GPU calculation force utilization rate does not meet the requirement, determining whether an elastic redundancy threshold of the GPU fragments needs to be increased or not; if the elastic redundancy threshold of the GPU fragments does not need to be increased, acquiring the deletion rate of the running data in unit time to determine whether the acquisition integrity of the running data meets the requirement; if the acquisition integrity of the operation data does not meet the requirement, determining whether the node resource load adaptation factor needs to be reduced; If the node resource load adaptation factor does not need to be reduced, determining an architecture fluctuation suppression threshold of the GPU fragment based on the computational fluctuation amplitude of the local computational pool in unit time; Determining whether the improvement effectiveness of the GPU computing power utilization rate meets the requirements based on the matching accuracy of the task queue and the local computing power pool comprises the following steps: comparing the matching accuracy of the task queue and the local calculation pool with a preset second accuracy; If the matching accuracy of the task queue and the local computing pool is greater than or equal to the preset second accuracy, determining that the improvement effectiveness of the GPU computing power utilization rate meets the requirement; If the matching accuracy of the task queue and the local computing pool is smaller than the preset second accuracy, determining that the improvement effectiveness of the GPU computing power utilization rate does not meet the requirement; determining whether an increase in the elastic redundancy threshold of the GPU fragment is required includes: the matching accuracy of the task queue and the local calculation pool is respectively compared with a preset first accuracy and a preset second accuracy; if the matching accuracy of the task queue and the local computing pool is smaller than or equal to the preset first accuracy, determining that the elastic redundancy threshold of the GPU fragments needs to be increased; if the matching accuracy of the task queue and the local computing pool is larger than the preset first accuracy and smaller than the preset second accuracy, determining that the elastic redundancy threshold of the GPU fragments does not need to be increased; the increasing amplitude of the elastic redundancy threshold value of the GPU fragments is determined by presetting a difference value between the first accuracy and the matching accuracy of the task queue and the local computing pool.
2. The method for automatically aggregating and scheduling for improving the power utilization of a GPU according to claim 1, wherein determining whether the acquisition integrity of the operation data meets the requirements based on the missing rate of the operation data in a unit time, comprises: comparing the deletion rate of the running data in unit time with a preset first deletion rate; If the loss rate of the operation data in the unit time is smaller than or equal to the preset first loss rate, determining that the acquisition integrity of the operation data meets the requirement, and determining whether the elastic redundancy threshold of the GPU fragments meets the requirement; If the deletion rate of the operation data in the unit time is larger than the preset first deletion rate, determining that the acquisition integrity of the operation data is not in accordance with the requirement.
3. The method for automatically aggregating and scheduling for improving GPU power utilization of claim 2, wherein determining whether a node resource load adaptation factor needs to be reduced comprises: comparing the deletion rate of the operation data in the unit time with the preset first deletion rate and the preset second deletion rate respectively; if the deletion rate of the running data in the unit time is larger than a preset first deletion rate and smaller than the preset second deletion rate, determining that the node resource load adaptation factor needs to be reduced; if the deletion rate of the running data in the unit time is greater than or equal to the preset second deletion rate, determining that the node resource load adaptation factor does not need to be reduced.
4. The automatic aggregation scheduling method for improving the power utilization rate of a GPU according to claim 3, wherein the reduction range of the node resource load adaptation factor is determined by a difference between the deletion rate of the running data in unit time and a preset first deletion rate.
5. The method for automatically aggregating and scheduling for improving GPU computing power utilization of claim 4, wherein determining the architectural fluctuation suppression threshold for GPU fragments based on the computing power fluctuation magnitude of the local computing pool per unit time comprises: Comparing the fluctuation amplitude of the power of the local power calculation pool in unit time with a preset fluctuation amplitude; If the fluctuation amplitude of the local computing pool in the unit time is smaller than or equal to the preset fluctuation amplitude, determining that the validity of the operation data meets the requirement, and determining whether the node resource load adaptation factor meets the requirement or not without reducing the framework fluctuation suppression threshold of the GPU fragments; If the fluctuation amplitude of the local computing pool in the unit time is larger than the preset fluctuation amplitude, determining that the validity of the operation data is not in accordance with the requirement, and reducing the framework fluctuation suppression threshold of the GPU fragments.
6. The automatic aggregation scheduling method for improving the utilization rate of the GPU according to claim 5, wherein the fluctuation range of the local computing force in the unit time is the difference between the maximum computing force and the minimum computing force of the local computing force in the unit time.
7. The method for automatically aggregating and scheduling for improving the power utilization of a GPU according to claim 6, wherein the magnitude of the reduction of the frame fluctuation suppression threshold of the GPU fragments is determined by a difference between the power fluctuation magnitude of the local power pool and a preset fluctuation magnitude in unit time.

Description

Automatic aggregation scheduling method for improving GPU computing power utilization rate Technical Field The invention relates to the technical field of data processing, in particular to an automatic aggregation scheduling method for improving the computing power utilization rate of a GPU. Background In the prior art, heterogeneous GPU clusters have become the core infrastructure supporting large scale computing power requirements. However, the conventional GPU computing power aggregation scheduling technology still has the limitations that on one hand, GPU loads have the characteristic of rapid dynamic change, delay and deviation are easy to occur in cross-node data synchronization, data loss or breakage are easy to occur in edge node network jitter and hardware interruption, original monitoring data can be further caused by noise introduced by host scheduling, instance interference and the like, scheduling decisions are generated based on low-quality data, resource allocation misjudgment is caused, and on the other hand, the conventional scheduling strategy lacks a mechanism for data quality quantification management and dynamic parameter optimization, so that the problem of insufficient improvement of the utilization rate of GPU computing power is caused. CN119917224A discloses a cloud platform AI computing power scheduling method, device and equipment, wherein the method comprises the steps of extracting a target model, target slots and target quantity from VGPU information of a source virtual machine after a virtual machine migration instruction is received, wherein the target model is a model for providing GPU of the VGPU for the source virtual machine, the target slots are slots where GPU of the VGPU is provided for the source virtual machine, the target quantity is the quantity of VGPU provided by GPU of the target slots for the source virtual machine, and the target slots are in one-to-one correspondence with the target quantity; The method comprises the steps of determining alternative computing nodes from other computing nodes, detecting whether a first computing node exists in the alternative computing nodes, wherein the number of idle VGPUs of the GPUs of the first computing node in each target slot is larger than or equal to the corresponding target number, and determining the target computing nodes from the first computing nodes if the first computing node exists. Therefore, the cloud platform AI algorithm power scheduling method, the cloud platform AI algorithm power scheduling device and the cloud platform AI algorithm power scheduling equipment have the problems that as GPU monitoring data are not preprocessed, nodes are easily interfered by short-term noise, a resource fragment aggregation mechanism is not available, discrete GPU fragments cannot be integrated, resource waste is caused, optimal node judgment deviation is caused, and the improvement effectiveness of the GPU algorithm power utilization rate is insufficient. Disclosure of Invention Therefore, the invention provides an automatic aggregation scheduling method for improving the power utilization rate of a GPU (graphics processing Unit), which is used for solving the problems that in the prior art, as GPU monitoring data are not preprocessed, nodes are easy to be interfered by short-term noise, a resource fragment aggregation mechanism is not available, discrete GPU fragments cannot be integrated, resource waste is caused, optimal node judgment deviation is caused, and the improvement effectiveness of the power utilization rate of the GPU is insufficient. In order to achieve the above object, the present invention provides an automatic aggregation scheduling method for improving the power utilization rate of a GPU, including: Respectively acquiring operation data of a GPU and a calculation task set, processing the operation data to obtain a resource information base, extracting characteristics of the calculation task set to obtain standard task characteristics, and classifying the calculation task set based on the standard task characteristics to obtain a task queue; The GPU fragments meeting preset conditions in all nodes are aggregated based on the resource information base to obtain a local computing pool, and the scheduling mode of the task queue is determined based on the comparison result of the total computing force of the local computing pool and the task computing force demand; acquiring the matching accuracy of a task queue and a local computing pool, and determining whether the improvement effectiveness of the GPU computing power utilization rate meets the requirement or not based on the matching accuracy of the task queue and the local computing pool; if the improvement effectiveness of the GPU calculation force utilization rate does not meet the requirement, determining whether an elastic redundancy threshold of the GPU fragments needs to be increased or not; if the elastic redundancy threshold of the GPU fragment