CN-121996422-A - K8s and Slurm mixed deployment control method

CN121996422ACN 121996422 ACN121996422 ACN 121996422ACN-121996422-A

Abstract

The invention aims to provide a K8s and Slurm mixed deployment management and control method, in particular to a Kubernetes and Slurm mixed deployment management and control method based on dynamic resource awareness. The invention introduces an intelligent Node Agent, which not only predicts trend according to real-time water level, but also according to historical load characteristics, so as to carry out refined resource arbitration on the Linux Cgroup layer facing K8s (online) and Slurm (offline). The resource conflict and invisible problems of the heterogeneous scheduling system (K8 s and Slurm) in the mixed deployment of the same node and the performance interference problem of the online service and the offline operation are solved, the maximization of the resource utilization rate is realized, and the quality of service (QoS) of the online service is ensured. The invention has higher stability and safety, and the micro-architecture layer is anti-interference.

Inventors

ZHANG FAN
LV XINHUI
CAO WEI
SUN MINGQIAN
PEI CHEN

Assignees

复旦大学

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (3)

1. The K8s and Slurm mixed deployment control method is characterized in that the control method is realized by a control system, and the control system adopts a three-layer double-stack structure and comprises a control surface layer, a node agent layer and a kernel isolation layer, wherein: The Control surface layer is positioned on the top layer, one side of the Control surface layer is a K8s Control Plane, the other side of the Control surface layer is a Slurm controller (Slurm Controller), and the Control surface layer are not communicated with each other logically, the K8s Control Plane issues a scheduling instruction to the middle layer, and the Slurm controller issues an operation to the middle layer; The node agent layer is positioned in the middle layer, and node agents are deployed on each physical node, wherein each node agent comprises a data acquisition unit, a heuristic decision engine and a resource executor, the output end of the data acquisition unit is connected with the input end of the heuristic decision engine, the output end of the heuristic decision engine is connected with the input end of the resource executor through calculation quota, the data acquisition unit is in butt joint Kubelet Summary API and/sys/fs/Cgroup to acquire second-level resource indexes, the heuristic decision engine maintains a historical time sequence data window to calculate dynamic safety water level, and the resource executor directly operates a Cgroup v2 interface (cpu.max, memory.high) to implement pressing; the device comprises a core isolation layer, a core isolation layer and a core isolation layer, wherein the core isolation layer is positioned at the bottom layer and comprises a high priority and a low priority, the high priority corresponds to an online service, and the low priority corresponds to an offline service; the control method is an infinitely-circulated control loop and comprises the following specific steps: (1) Status snapshot, node Agent every other Periodically pull the most recent A K8s resource usage record of each cycle; (2) Heuristic prediction, namely calculating a predicted value of resource demand at the future moment of K8s according to historical data And fluctuation rate ; (3) Calculating the water level by combining the predicted value and the fluctuation rate to calculate the safety quota reserved for Slurm ; (4) Policy issuing, wherein the current resource usage is that : If it is The Slurm Cgroup limit is relaxed; If it is Shrink Slurm Cgroup limit; (5) Checking PSI index, if finding out the PSI sudden increase of the on-line service, directly skipping the calculation step, and triggering the step (6) to perform 'emergency fusing'; (6) Emergency fusing, namely freezing or ending Slurm processes; (7) Maintaining the strategy, namely maintaining the current scheduling strategy unchanged.
2. The method for controlling mixed deployment of K8s and Slurm as claimed in claim 1, wherein the step (4) adopts a dynamic buffering algorithm based on fluctuation rate perception, specifically comprising the following steps: the Node Agent does not directly use the current instantaneous utilization, but calculates the predicted peak, ; Wherein: The period is allocated to the upper limit of Slurm resources; node physical total resources; Trend items; wave buffer item Static bottom protection; trend item Calculating a load baseline of K8s by using an exponentially weighted moving average, ; Compared with simple average, the EMA has larger weight on the nearest data point and can respond to the load sudden rise of K8s more quickly; Wave buffer term : Is the standard deviation of the load in the history window, represents the "jitter degree" of the K8s service, The heuristic logic is that if the online service is very stable, namely StdDev is small, a small Buffer is reserved, so that Slurm uses more resources, if the online service is up to the next hop, namely StdDev is large, the Buffer is automatically enlarged, and more resources are reserved to prevent undetected; static bottom protection The minimum reserved water level is adopted to prevent the instant overstock during cold start.
3. The method of claim 1, wherein the PSI-driven emergency fusing mechanism in step (6) processes millisecond/second-level resource allocation, and the PSI fusing mechanism is specifically as follows for microsecond-level micro-architecture contention: Monitoring indexes, namely reading/proc/pressure/cpu and/proc/pressure/memory, wherein the time is partially blocked by a name index, when the PSI of K8s Cgroup is a threshold value, judging that serious resource conflict occurs, triggering subsequent actions, wherein the triggering actions comprise ignoring the heuristic calculation result, directly reducing the cpu.max of Slurm Cgroup to 10% or triggering a cpu.freeze suspending operation until the PSI is recovered to be normal.

Description

K8s and Slurm mixed deployment control method Technical Field The invention belongs to the technical field of cloud computing big data edge computing and artificial intelligence, in particular relates to a K8s and Slurm mixed deployment management and control method, and particularly relates to a Kubernetes and Slurm mixed deployment management and control method based on dynamic resource awareness. Background In the current data center architecture, there are generally two types of cluster management systems and services, one type is a service represented by micro services, including various types of online processing services and offline processing services, which generally run in a K8s cluster, and the other type is an offline Batch job (Batch job) represented by scientific computing and AI training, which generally runs in an HPC cluster such as Slurm. The current cluster management and job scheduling techniques are mainly divided into the following paths: (1) Non-intrusive hosting scheme (e.g., patent CN 115237547B) describes a non-intrusive HPC cluster hosting method that converts the workload of Kubernetes into scripts or instructions executable by HPCs (e.g., slurm) through custom resources and configurators, implementing unified task delivery. (2) Heterogeneous job scheduling and plug-in driving schemes (e.g., patent CN116661979 a) which describes a method of interfacing multiple compute clusters (K8 s, slurm, etc.) through a virtual node controller and a plug-in bus. It abstracts the different computing resources into virtual nodes and distributes tasks to the corresponding backend clusters through a scheduler. (3) Template-based job scheduling schemes (e.g., patent CN117093352 a) describe a method for adapting different computational frameworks by job invocation templates, solving the problem of complexity of parameter configuration when users submit jobs in different clusters. (4) Deep learning specialized scheduling schemes (e.g., patent CN108920259 a) that describe container scheduling methods for deep learning jobs that optimize container creation and task submission by monitoring container status and job identification. Through analysis of the prior art, the following defects exist in the K8s and Slurm depth fusion scene: (1) Single resource "split brain" and conflict problems: The prior art (e.g., CN115237547B and CN116661979 a) focused on achieving "task subcontracting" or "unified hosting" at the control plane, i.e., letting K8s be an entry to manage Slurm. But at the physical node level, if the K8s Pod and Slurm Job run on the same machine at the same time, the two schedulers do not perceive the real-time CPU/memory occupied by each other. The disadvantage of this solution is that it can lead to severe resource overstock or conflict, and Slurm jobs can preempt resources of K8s critical traffic (on-line traffic), causing OOM (memory overflow) or delay a drastic increase. The invention provides a Node Agent mechanism, which does not rely on control plane synchronization, but directly establishes dynamic arbitration on a physical Node kernel layer through Cgroup V2. (2) Resource views are static and utilization imbalanced: The prior art (e.g., patent CN108920259 a) typically employs a static resource request mode. To ensure online traffic security, a large amount of idle resources are usually reserved. The disadvantage is that the static partition causes that the resources cannot be fully utilized by Slurm operations in the business valley period, and the overall cost is high. According to the method, a heuristic dynamic water level algorithm is introduced, the real-time safety quota is calculated, and the method can dynamically borrow 'fragmented and instantaneous' idle resources to Slurm for use according to the historical fluctuation characteristic of K8s service, so that the resource utilization rate is greatly improved. (3) Interference detection at the level of lack of microarchitecture: The prior art (such as patent CN117093352 a) only optimizes from the job submission and logical scheduling level, and cannot perceive contention of underlying hardware (such as L3 cache, memory bandwidth). The disadvantage is that when the offline computing job (Slurm) performs a high-intensity computation, even if the CPU utilization is not full, the memory bus pressure may cause the online service (K8 s) to respond slowly. The invention introduces PSI (Pressure Stall Information) -based emergency fusing mechanism. When detecting that the online service is blocked in microsecond level, node Agent can freeze Slurm the process immediately, so that higher service reliability guarantee is provided. The invention aims to solve the problems of resource conflict and invisibility of a heterogeneous scheduling system (K8 s and Slurm) in the mixed deployment of the same node and the performance interference problem of online service and offline operation, realize the maximization of the resource utilization ra