CN-121722573-B - High-density server cluster energy efficiency optimization method based on resource scheduling
Abstract
The invention relates to the technical field of high-density server energy efficiency optimization, in particular to a high-density server cluster energy efficiency optimization method based on resource scheduling. And establishing a global resource scheduling model taking the minimum total power consumption of the cluster as an objective function and taking the resource capacity and the service quality requirement as constraints according to the node load characteristic value and the cluster energy efficiency optimization target. And solving the model to obtain an optimized resource scheduling strategy for definitely designating the task to be migrated, the target node and the migration time sequence. And generating a control instruction according to the strategy, and sending the control instruction to a bottom layer resource manager to execute the dynamic migration of the task and the adjustment of the node state. The invention realizes the fine and dynamic optimization management of the cluster energy efficiency.
Inventors
- RUAN SHUAI
- DENG YULIN
Assignees
- 成都越煌欣科技有限公司
- 成都西川空间科技有限公司
- 成都牛德德网络科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260213
Claims (9)
- 1. The high-density server cluster energy efficiency optimization method based on resource scheduling is characterized by comprising the following steps of: Collecting multidimensional operation state data of each computing node in a high-density server cluster in real time, inputting the multidimensional operation state data into a pre-constructed load characteristic analysis model, quantitatively evaluating the real-time workload of each computing node based on the load characteristic analysis model, and generating a node load characteristic value representing the node load degree; According to the node load characteristic value and a preset cluster energy efficiency optimization target, a global resource scheduling model is established, wherein the global resource scheduling model takes the minimum total power consumption of a cluster as an objective function, and takes the resource capacity of a computing node and the service quality requirement of a task as constraint conditions; solving the global resource scheduling model to obtain an optimized resource scheduling strategy, wherein the optimized resource scheduling strategy specifically designates a computing task to be migrated, a target computing node and an execution time sequence of task migration; generating a specific resource scheduling control instruction according to the optimized resource scheduling strategy, and sending the resource scheduling control instruction to a bottom resource manager of a high-density server cluster, wherein the bottom resource manager executes the dynamic migration of a computing task and the adjustment operation of the working state of a computing node; And generating a specific resource scheduling control instruction according to the optimized resource scheduling strategy, wherein the specific resource scheduling control instruction comprises the following specific steps: analyzing the optimized resource scheduling strategy, and extracting a calculation task list to be migrated, an original calculation node identifier of each task to be migrated, a target calculation node identifier planned to be migrated and a migration time window planned to be executed; Inquiring running context information of each computing task to be migrated on an original computing node, wherein the running context information comprises a memory page state, a file descriptor, a process state and a network connection state; generating a check point data creation instruction and a context serialization instruction for a computing task according to the running context information so as to save and restore the task state in the migration process; Generating a detailed migration operation sequence for each task to be migrated by combining the migration time window, wherein the operation sequence comprises the steps of suspending the task on an original computing node, creating a check point, transmitting data to a target node, recovering the task state on the target node and restarting the task; sequencing and integrating the migration operation sequences of all tasks to be migrated according to the time sequence of planned execution, and inserting synchronous waiting points to form a final executable scheduling plan; And translating the executable scheduling plan into a script or an application programming interface call sequence which can be identified and executed by the bottom resource manager, namely the resource scheduling control instruction.
- 2. The resource scheduling-based high-density server cluster energy efficiency optimization method according to claim 1, wherein the collecting, in real time, multidimensional running state data of each computing node in the high-density server cluster is specifically as follows: the multidimensional running state data at least comprises the central processing unit utilization rate, the memory occupancy rate, the input/output throughput and the current running temperature of the computing node; Periodically acquiring the reading of a hardware performance counter of each computing node by a monitoring agent program deployed on each computing node according to a preset sampling period, wherein the reading of the hardware performance counter comprises the active cycle number of each core of a central processing unit, the cache miss times and the access delay of a memory controller; Meanwhile, a system call interface of a computing node operating system is called, and process-level resource consumption statistical information is obtained, wherein the statistical information comprises the occupation of a central processing unit time slice of each running process, the size of a physical memory resident set, the number of disk read-write operations and the number of network connections; Reading physical environment monitoring data through a sensor interface of a computing node mainboard management controller, wherein the physical environment monitoring data comprises a central processing unit packaging temperature, a dynamic random access memory temperature, a mainboard temperature and server air inlet and air outlet temperatures; Aligning and packaging the hardware performance counter reading, the process-level resource consumption statistical information and the physical environment monitoring data according to a timestamp, and packaging the data into the multidimensional running state data packet with a uniform format; and transmitting the multidimensional running state data packet generated by each computing node to a centralized data sink node in real time through a cluster internal communication network for storage and subsequent processing.
- 3. The method for optimizing the energy efficiency of a high-density server cluster based on resource scheduling according to claim 2, wherein the multi-dimensional running state data is input into a pre-constructed load feature analysis model, real-time workload of each computing node is quantitatively evaluated based on the load feature analysis model, and node load feature values representing node load degrees are generated, specifically: preprocessing the converged multidimensional running state data packet, including data cleaning to remove abnormal values and data normalization to eliminate the influence of different dimensions; extracting a group of predefined key load characteristic indexes from the preprocessed data, wherein the key load characteristic indexes comprise a sliding average value of the utilization rate of a central processing unit, the difference between a peak value and a valley value of the memory occupancy rate, an input and output operation frequency and a temperature rising rate; Feeding the key load characteristic indexes into the pre-constructed load characteristic analysis model by taking the key load characteristic indexes as input vectors, wherein the load characteristic analysis model is a gradient lifting decision tree model obtained based on historical data training; the gradient lifting decision tree model outputs a comprehensive score according to the input key load characteristic index, and the comprehensive score reflects the composite load pressure level of the computing node at the current moment; And according to a predefined score interval in which the comprehensive score is located, combining with calculating the upper limit of the hardware resource configuration of the node, mapping the comprehensive score into a standardized node load characteristic value, wherein the node load characteristic value is a numerical value between zero and one, and the larger the numerical value is, the higher the representative load degree is.
- 4. The method for optimizing the cluster energy efficiency of the high-density server based on the resource scheduling according to claim 3, wherein the establishing a global resource scheduling model according to the node load characteristic value and a preset cluster energy efficiency optimization target is specifically as follows: acquiring node load characteristic values of all computing nodes in a cluster to form a cluster load state vector at the current moment; Reading a preset cluster energy efficiency optimization target, wherein the cluster energy efficiency optimization target is specifically expressed as minimizing the integral of the total cluster energy consumption in a scheduling period on the premise of meeting the deadline and performance requirements of all computing tasks; constructing a decision variable of the global resource scheduling model, wherein the decision variable comprises a binary variable and a continuous variable, the binary variable represents a mapping relation between a computing task and a computing node, and the continuous variable represents the working frequency or voltage of each time slice of the computing node in a scheduling period; Establishing a constraint condition set of the global resource scheduling model, wherein the constraint condition set comprises a calculation task allocation constraint that each calculation task must be allocated and only allocated to one calculation node to run, a node capacity constraint that the sum of all calculation task resource demands allocated to each calculation node must not exceed the available resource capacity of the calculation node, a hardware feasibility constraint that the working frequency or voltage of each calculation node needs to be within the dynamic regulation range of the hardware support of the calculation node, and a task deadline constraint that the time from the start of execution to the completion of each calculation task must not exceed the specified deadline; And constructing an objective function of the global resource scheduling model, wherein the objective function is a mathematical expression of the total cluster energy consumption, and the total cluster energy consumption is obtained by summing static power consumption and dynamic power consumption of each computing node, wherein the dynamic power consumption is positively related to the working frequency, voltage and load carried by the computing nodes.
- 5. The method for optimizing energy efficiency of a high-density server cluster based on resource scheduling according to claim 4, wherein the solving the global resource scheduling model obtains an optimized resource scheduling policy, specifically: Decomposing the global resource scheduling model by adopting a Lagrangian relaxation algorithm, and decomposing the global resource scheduling model into a main problem and a plurality of sub-problems, wherein the sub-problems are associated with a single computing node or a single computing task; initializing Lagrangian multipliers in the main problem, and setting termination conditions of an algorithm, wherein the termination conditions comprise maximum iteration times or a dual gap smaller than a set threshold; In each iteration, fixing the value of the Lagrangian multiplier, solving each sub-problem in parallel, obtaining a feasible task allocation scheme and a node working state under the current multiplier, updating the Lagrangian multiplier in the main problem according to the solving result of the sub-problem, and updating by adopting a secondary gradient method; Repeating the iterative process until the termination condition is met, and obtaining a group of Lagrangian multiplier values and corresponding task allocation schemes for enabling the dual gap to be smaller than a set threshold value; And carrying out feasibility restoration on the finally obtained task allocation scheme to ensure that the task allocation scheme meets all original constraint conditions, wherein the restored scheme is the optimized resource scheduling strategy.
- 6. The method for optimizing energy efficiency of a high-density server cluster based on resource scheduling according to claim 5, wherein the issuing the resource scheduling control instruction to a bottom resource manager of the high-density server cluster, the bottom resource manager executing the operations of dynamically migrating computing tasks and adjusting the working states of computing nodes, specifically comprises: The bottom resource manager receives the resource scheduling control instruction and performs security and validity check on the resource scheduling control instruction; after verification is passed, the bottom resource manager firstly sends a task pause and check point creation command to an original computing node agent where a task to be migrated is located according to an instruction sequence; The original computing node agent executes the command, pauses the execution of the appointed computing task, and stores the memory state and the processor state thereof in a serialization manner as a check point file, and simultaneously freezes related input and output operations; the bottom layer resource manager coordinates an original computing node agent and a target computing node agent, and transmits the check point file and related context data to the target computing node through a network; After receiving the data, the target computing node agent re-creates the execution environment of the computing task on the target computing node according to the instruction, restores the state of the computing task from the check point file, and then restores the execution of the computing task; And at the same time or after the task migration, the bottom layer resource manager sends an instruction for adjusting the working frequency or voltage to the related computing nodes according to the node working state specified in the scheduling strategy so as to achieve the energy efficiency optimization target.
- 7. The method for optimizing energy efficiency of a high-density server cluster based on resource scheduling according to claim 6, wherein the original computing node agent executes a command to suspend execution of a specified computing task, and stores a memory state and a processor state thereof in a serialization manner as a checkpoint file, and freezes related input and output operations, specifically comprising: the original computing node agent sends a pause signal to the process of the target computing task to enable the process to enter an interruptible sleep state; Traversing the virtual address space of the process, and copying the contents of all resident memory pages to a pre-allocated buffer area; Reading the content of a processor register group of the process, including a general register, a program counter, a stack pointer and a flag register, and packaging the general register, the program counter, the stack pointer and the flag register together with memory page data; recording the current file offset and socket state metadata information of the file descriptor and the network socket opened by the process; Carrying out serialization coding on the memory page data, the processor register data and the input and output metadata information, and writing the memory page data, the processor register data and the input and output metadata information into the persistently stored check point file; During checkpoint creation, the process is blocked from all new input-output requests and waits for the issued input-output operations to complete to ensure data consistency.
- 8. The resource scheduling-based high-density server cluster energy efficiency optimization method of claim 3, wherein the constructing step of the pre-constructed load feature analysis model comprises: collecting historical multidimensional operation state data of a plurality of computing nodes in a high-density server cluster in a historical period, and marking the historical multidimensional operation state data to obtain a training data set with a real load label; Preprocessing the training data set, wherein the preprocessing operation comprises missing value filling, outlier processing and data standardization; Extracting historical characteristic indexes corresponding to the key load characteristic indexes from the preprocessed training data set to form a training characteristic vector set; initializing a gradient lifting decision tree model framework and configuring model super parameters, wherein the super parameters comprise the number of basic learners, the learning rate and the maximum depth of the tree; using the training feature vector set and the corresponding real load label, and carrying out iterative training on the initialized gradient lifting decision tree model with the aim of minimizing the prediction error; And evaluating the performance of the trained gradient lifting decision tree model on the verification data set, and performing super-parameter tuning according to the evaluation result until the model performance reaches a preset standard, so as to obtain the pre-constructed load characteristic analysis model.
- 9. The resource scheduling-based high-density server cluster energy efficiency optimization method of claim 4, further comprising the step of constructing a computing node power consumption estimation model prior to said constructing decision variables of said global resource scheduling model: Based on hardware specification and historical operation data of the computing node, establishing a functional relation between the power consumption of the computing node and the utilization rate of a central processing unit, the occupancy rate of a memory, the working frequency and the environmental temperature; Fitting each coefficient in the functional relation by using a multiple linear regression or support vector regression algorithm in an off-line training mode to obtain the power consumption estimation model of each calculation node individual; And embedding the power consumption estimation model into the objective function of the global resource scheduling model in a parameter form for calculating the dynamic power consumption of the calculation node under different loads and configurations.
Description
High-density server cluster energy efficiency optimization method based on resource scheduling Technical Field The invention relates to the technical field of high-density server energy efficiency optimization, in particular to a high-density server cluster energy efficiency optimization method based on resource scheduling. Background The technical field of energy efficiency management of high-density server clusters. The current mainstream energy efficiency optimization method mainly relies on single indexes such as CPU utilization rate and the like to carry out load judgment, and task scheduling or node power consumption control is implemented based on the load judgment. The method regards load evaluation and resource scheduling as two relatively independent links, the evaluation model is simpler, and the scheduling targets are mainly the resource utilization rate. The prior art has limitations that a simple threshold value or a static weighting mode is difficult to accurately reflect the real comprehensive state of a server when a hybrid load is operated, so that an evaluation result is one-sided. The scheduling strategy taking load balancing as a guide generally lacks closed loop optimization of overall power consumption, and the idle nodes can be generally controlled to be switched on and off only after the task placement is completed, so that the scheduling strategy belongs to post-remediation measures, and the energy efficiency bottleneck can not be avoided in advance in scheduling decisions. The key problem to be solved by the invention is to realize accurate quantification of load and process control of scheduling. The method comprises the steps of establishing an analysis model capable of fusing multidimensional real-time data, outputting a unified load characteristic value to overcome the deviation of single index evaluation, and establishing a scheduling model taking the total power consumption of a system as a direct optimization target, wherein the output result of the scheduling model can guide the specific execution sequence of task migration, so that the management of energy consumption is extended from a resource allocation layer to a dynamic operation process layer, and the fine energy efficiency management and control are realized. Disclosure of Invention The invention aims to solve the defects in the prior art, and provides a high-density server cluster energy efficiency optimization method based on resource scheduling. In order to achieve the purpose, the invention adopts the following technical scheme that the high-density server cluster energy efficiency optimization method based on resource scheduling comprises the following steps: Collecting multidimensional operation state data of each computing node in a high-density server cluster in real time, inputting the multidimensional operation state data into a pre-constructed load characteristic analysis model, quantitatively evaluating the real-time workload of each computing node based on the load characteristic analysis model, and generating a node load characteristic value representing the node load degree; According to the node load characteristic value and a preset cluster energy efficiency optimization target, a global resource scheduling model is established, wherein the global resource scheduling model takes the minimum total power consumption of a cluster as an objective function, and takes the resource capacity of a computing node and the service quality requirement of a task as constraint conditions; solving the global resource scheduling model to obtain an optimized resource scheduling strategy, wherein the optimized resource scheduling strategy specifically designates a computing task to be migrated, a target computing node and an execution time sequence of task migration; And generating a specific resource scheduling control instruction according to the optimized resource scheduling strategy, and sending the resource scheduling control instruction to a bottom resource manager of the high-density server cluster, wherein the bottom resource manager executes the dynamic migration of the computing task and the adjustment operation of the working state of the computing node. As a further scheme of the present invention, the collecting, in real time, multidimensional operation status data of each computing node in the high-density server cluster specifically includes: the multidimensional running state data at least comprises the central processing unit utilization rate, the memory occupancy rate, the input/output throughput and the current running temperature of the computing node; Periodically acquiring the reading of a hardware performance counter of each computing node by a monitoring agent program deployed on each computing node according to a preset sampling period, wherein the reading of the hardware performance counter comprises the active cycle number of each core of a central processing unit, the cache miss times and