CN-121984933-A - GPU cluster scheduling method, device and equipment based on four-layer network topology

CN121984933ACN 121984933 ACN121984933 ACN 121984933ACN-121984933-A

Abstract

The invention discloses a four-layer network topology-based GPU cluster scheduling method, device and equipment, wherein the method comprises the steps of constructing a resource image based on historical data, integrating a server to a four-layer topology state of a super convergence layer to form system state information; the method comprises the steps of calculating attack levels for tasks, establishing a double-weight preemption relation taking the attack levels as unique conditions and being larger than defense levels, generating candidate schemes according to system states in sequence from inside to outside, checking calculation and network resource capacity constraint layer by layer, calculating total points which are obtained by summarizing complete components, scale points and topology affinities for feasible schemes meeting constraint conditions, deducting preemption cost to obtain net gain, selecting schemes with the maximum net gain and positive values for execution, and automatically lifting the defense levels by one level after the tasks are preempted, so that preemption times of the schemes are strictly limited. The method and the device remarkably improve the utilization rate of the GPU and the HBM, effectively reduce the communication overhead of the cross-layer network, and ensure the stability and fairness of the scheduling process.

Inventors

YU ZHIWEN
MA YUERUI
LUAN YAJIAN
PAN JUNBIN
HAN FENGJING
ZHANG HONGXI

Assignees

云宙时代科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (10)

1. The GPU cluster scheduling method based on the four-layer network topology is characterized by comprising the following steps of: S1, constructing a resource portrait based on historical operation data of tasks and GPU clusters, collecting four layers of network topology, wherein the four layers of network topology comprise a server layer, a cabinet layer, a convergence layer and a super convergence layer in sequence from inside to outside, and the two layers of information jointly form state information of a dispatching system, namely the total capacity and the real-time utilization rate of computing resources and network communication resources of each layer; S2, calculating an attack level for the task to be scheduled according to the resource demand and the communication characteristic of the task to be scheduled, initializing the defense level to the attack level, and establishing a double-weight relation which allows preemption only when the attack level of the new task is greater than the defense level of the running task; s3, generating initial placement candidate schemes according to the system state information and the task demand to be scheduled and following a four-layer topology sequence from inside to outside, and for each candidate scheme, jointly checking that the total amount of task demand is not exceeded by superposition of the used amount of computing resources and network communication resources at each layer of the four-layer topology, and executing the hard constraint filtering; S4, calculating the completion component, scale component and topological affinity of each candidate scheme according to the candidate scheme passing through the hard constraint, summarizing, subtracting the preemption cost to obtain a net gain, and selecting the candidate scheme with the maximum net gain and greater than zero as a final scheduling decision to be executed; S5, executing the final scheduling decision, terminating the corresponding preempted task when the preempted task is involved, immediately checking whether four-layer topology resources are out of limit after executing, and rolling back if the four-layer topology resources are out of limit, otherwise updating the system resource state; S6, after the task is preempted, the defending level of the task is raised to be higher by one level according to a preset rule, so that the bounded control on the preemptable times of the task is realized; s7, after sequencing the tasks which arrive in batches according to a preset rule, sequentially executing the steps S2 to S5, and refreshing the system state after each scheduling as the input of the next scheduling.
2. The GPU-cluster scheduling method based on four-layer network topology according to claim 1, wherein in step S4, topology affinity is Calculated as follows: ; Wherein, the Rewards brought by the fact that the resource utilization rate exceeds a preset threshold value but does not exceed capacity constraint; the method comprises the steps of rewarding layout uniformity obtained when priority weights of load-bearing tasks in a server or a cabinet are consistent; a resource monopolization reward obtained when a server or a cabinet is monopolized by a single task; penalty for cross-layer communication of tasks due to placement locations; For cross-layer attenuation coefficient, when task deployment crosses a convergence layer or a super convergence layer, for And The term is subjected to an attenuation of the term, 。
3. The GPU cluster scheduling method based on four-layer network topology according to claim 1, wherein the constructing the resource portraits in step S1 includes collecting historical time sequence data using a sliding window, eliminating abnormal values in the window according to a3 sigma principle, discarding data before mutation points when the statistical characteristic change of the resource usage pattern exceeds a preset mutation threshold, and finally describing the peak value and the average value of the demands of all tasks on each resource dimension.
4. The GPU cluster scheduling method based on four-layer network topology according to claim 1, wherein when the initial placement candidate scheme is generated in the step S3, on the premise of meeting capacity hard constraint, a resource combination that can enable the resource utilization rate of the target server, the cabinet or the convergence layer to reach or approach to a preset optimization threshold after the requirement of stacking new tasks is preferentially selected, so as to improve local communication efficiency.
5. The GPU cluster scheduling method based on four-layer network topology according to claim 1, wherein in the step S3, when the candidate scheme needs to trigger preemption, the low-weight tasks and the medium-weight tasks meeting the preemption condition are ordered according to the unit cost density to generate a preemption list, and it is ensured that the ratio of the total number of the preempted low-weight tasks and the total number of the preempted medium-weight tasks in the total number of the running tasks of the respective categories does not exceed a preset threshold value in one scheduling decision.
6. The GPU cluster scheduling method based on four-layer network topology according to claim 1, wherein the preset ordering rule of the batch tasks in step S7 is that the tasks are ordered in descending order according to the task priorities, and the tasks with the same priorities are ordered in descending order according to the total scale of the resource demands of the GPU and HBM.
7. The four-layer network topology-based GPU cluster scheduling method of claim 1, wherein calculating an attack level in step S2 comprises calculating an initial score by a monotonically increasing scoring function based on the number of GPUs required for the task, HBM capacity, and estimated cross-layer communication overhead, and mapping the score into a predefined set of discrete levels to determine the attack level.
8. The four-layer network topology-based GPU cluster scheduling method of claim 1, wherein calculating the preemption costs in step S4 comprises, for each task in the candidate solution for which preemption is planned, calculating the loss costs thereof according to the length of time it has been run, the priority weight, and the sizes of the occupied GPU and HBM resources, and the sum of the loss costs of all the preempted tasks is the total preemption cost of the solution.
9. A four-layer network topology-based GPU cluster scheduling device, the device comprising: the resource portrait construction module is used for constructing a dynamic resource portrait based on the historical operation data of the task and the GPU cluster; The topology state acquisition module is used for acquiring the static total capacity and dynamic real-time utilization rate of computing resources and network communication resources of each level in a four-layer network topology, wherein the four-layer network topology comprises a server layer, a cabinet layer, a convergence layer and a super convergence layer in sequence from inside to outside, and the resource portraits and the topology state together form system state information; The task classification and preemption relation module is used for calculating an attack grade for the task to be scheduled according to the resource demand and the communication characteristic of the task to be scheduled, initializing the defense grade to the attack grade, establishing a double-weight relation which allows preemption only when the new task attack grade is greater than the defense grade of the running task, and raising the defense grade to a higher grade after the task is preempted; The candidate scheme generating module is used for generating an initial placement candidate scheme according to the system state information and task requirements and following a four-layer topological sequence from inside to outside, and executing a cross-threshold filling strategy in the generating process, namely preferentially selecting a resource combination which is more compact in physical deployment and can be saturated and matched with a preset threshold value in each dimension of resource utilization rate after the tasks are overlapped; The joint constraint verification module is used for jointly verifying whether the total amount of the used superposition task demands of the computing resources and the network communication resources exceeds the limit or not at each level of the four-layer topology for each candidate scheme, and outputting all feasible candidate schemes; the preemption screening and proportion control module is used for sequencing the tasks meeting preemption conditions according to the unit cost density to generate a preemption list when the preemption is triggered by the feasible candidate scheme, and ensuring that the ratio of the preempted low-weight tasks to the preempted medium-weight tasks does not exceed a preset threshold; The gain evaluation and decision module is used for calculating the completion component, scale component and topological affinity of the feasible candidate schemes, obtaining a net gain by subtracting the preemption cost after summarizing, and selecting the scheme with the maximum net gain and greater than zero as a final scheduling decision; The decision execution and rollback module is used for executing the final scheduling decision, immediately checking whether the four-layer topological resource exceeds the limit after execution, if so, rollback operation is carried out, otherwise, the system resource state is updated; and the batch scheduling module is used for iteratively calling the task classification and preemption relation module to the function of the decision execution and rollback module after sequencing the batch arriving tasks according to a preset rule, so as to complete batch scheduling.
10. An electronic device, configured to implement the four-layer network topology-based GPU cluster scheduling method of any of claims 1 to 8, comprising: the processor is used for executing all calculation tasks and realizing a GPU cluster scheduling method based on four-layer network topology; and the memory is used for storing the processor executable instructions and static data.

Description

GPU cluster scheduling method, device and equipment based on four-layer network topology Technical Field The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for GPU cluster scheduling based on a four-layer network topology. Background In the high-performance computing fields of artificial intelligence training, scientific computing and the like, a GPU cluster becomes an indispensable core computing infrastructure. Modern large-scale clusters are commonly interconnected with multi-type networks by adopting complex four-layer network topological structures (a server layer, a cabinet layer, a convergence layer and a super convergence layer), and aim to match different communication requirements of different scale workloads from single-machine tasks to kilocalorie-level distributed training and the like. However, the complex architecture for improving performance and flexibility also makes resource scheduling face a serious challenge, namely, how to efficiently cooperate with GPU, HBM and three types of network resources under strict multi-layer network capacity constraint, so as to realize optimal placement of tasks, which has become a key problem for guaranteeing overall performance of the cluster. To address this challenge, existing GPU cluster scheduling schemes mainly follow two typical approaches, but suffer from significant drawbacks. One is a multi-round iterative scheme, which adopts a hysteresis error correction mode of 'firstly meeting the calculation requirement and then passively adjusting according to network congestion'. The scheme is easy to cause frequent start and stop of tasks and jitter of system performance, and based on a task termination strategy of simple indexes (such as GPU overhead and start time), service quality and task progress are difficult to consider, and a global optimization view angle is lacked. And secondly, a local greedy scheme which relies on a set of fixed priority rules to make sequential decisions, such as preferentially scheduling certain types of tasks or preferentially filling local resources. The scheme is simple to realize, but is easy to fracture and process calculation and communication resources, is easy to cause the problems of resource fragmentation and queue head blocking, is generally lack of a fine and constrained preemption mechanism, and is difficult to ensure timely scheduling of high-priority tasks when the resources are highly competitive. In summary, the prior art mainly comprises the following short boards, namely a unified collaborative optimization framework for multidimensional resources (GPU/HBM/network) is lacking, a wooden barrel effect with unbalanced resource utilization rate is easy to generate, a preemption mechanism is designed to be rough, or preemption is lacking to cause high-priority task delay, or excessive preemption is used to destroy system stability and fairness, and third, under the complex topology constraint, global optimization targets and real-time decision requirements are difficult to balance in a limited dispatching cycle, and fourth, perception and modeling capability of a bottom four-layer network topology structure are weak, accurate matching and constraint of communication characteristics of tasks and physical network levels cannot be carried out, so that high-speed interconnection capability cannot be fully utilized. Therefore, a new scheduling method capable of deeply sensing topology, intelligently coordinating resources, and balancing efficiency and stability is needed. Disclosure of Invention The application provides a GPU cluster scheduling method based on four-layer network topology, which is characterized by comprising the following steps: S1, constructing a resource portrait based on historical operation data of tasks and GPU clusters, collecting four layers of network topology, wherein the four layers of network topology comprise a server layer, a cabinet layer, a convergence layer and a super convergence layer in sequence from inside to outside, and the two layers of information jointly form state information of a dispatching system, namely the total capacity and the real-time utilization rate of computing resources and network communication resources of each layer; S2, calculating an attack level for the task to be scheduled according to the resource demand and the communication characteristic of the task to be scheduled, initializing the defense level to the attack level, and establishing a double-weight relation which allows preemption only when the attack level of the new task is greater than the defense level of the running task; s3, generating initial placement candidate schemes according to the system state information and the task demand to be scheduled and following a four-layer topology sequence from inside to outside, and for each candidate scheme, jointly checking that the total amount of task demand is not exceeded b