CN-122019179-A - Cloud data center resource collaborative optimization method with self-adaptive end-to-end service level

CN122019179ACN 122019179 ACN122019179 ACN 122019179ACN-122019179-A

Abstract

The invention provides an end-to-end service level self-adaptive cloud data center resource collaborative optimization method which is characterized by comprising a workload prediction step based on historical task data, a step of dynamic violation budget allocation and resource quota calculation for each application type based on dynamic planning under the constraint of global SLA violation budgets, a virtual machine effective capacity calculation and safe overallocation virtual machine supply step based on historical utilization rate and safety margin factors, a virtual machine-to-physical machine placement step based on anti-affinity principle, and a double-layer load balancing task scheduling step based on deep reinforcement learning. The invention unifies workload prediction, SLA violation budget allocation, resource supply and placement and online task scheduling in one frame, and avoids the global suboptimal problem caused by isolation optimization of a resource allocation layer and a scheduling layer.

Inventors

Zuo Bijia
ZHANG YI
SUN JIN
WEI ZHIHUI

Assignees

南京理工大学

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (9)

1. A cloud data center resource collaborative optimization method with self-adaptation of an end-to-end service level is characterized by comprising a workload prediction step based on historical task data, a step of dynamic violation budget allocation and resource quota calculation for each application type based on dynamic planning under the constraint of global SLA violation budgets, a virtual machine effective capacity calculation and safe overallocation virtual machine supply step based on historical utilization and safety margin factors, a virtual machine-to-physical machine placement step based on an anti-affinity principle, and a double-layer load balancing task scheduling step based on deep reinforcement learning.
2. The method of claim 1, wherein the workload prediction step employs a long-short term memory network model to pass The task arrival sequence of each cycle is input, the predicted task number of each time point of the next cycle is output, and the predicted sequence is regarded as an experience sample for describing future demand uncertainty.
3. The method of claim 1, wherein the dynamic violation budget allocation step allocates a global violation budget The discretization is distributed to each application type, under the condition of meeting global violation constraint, the resource quota combination with minimum resource cost is solved, and the optimal violation budget and the number of virtual machines of each application type are obtained in the matrix backtracking process; The resource quota calculation takes the requirement value after the task processing sequence is sequenced as input, and the 'allowable violation times' are corresponding to high-requirement time points which can not be met, so that the SLA requirement is met in a probabilistic sense.
4. The method according to claim 1, wherein in the virtual machine effective capacity calculation step, the effective capacity is a nominal capacity multiplied by a historical average utilization multiplied by a safety margin factor, and the effective capacity is made not to exceed the nominal capacity by upper limit clipping.
5. The method according to claim 1 or 4, wherein the virtual machine placement step aims at reducing the number of virtual machines of the same application type on the same physical machine, and the fault tolerance is improved on the premise that the number of required physical machines is minimum by adopting a polling or anti-affinity rule.
6. The method of claim 1, wherein the task scheduling step defines a state space as a combination of virtual machine load vector, physical machine load vector and current task batch information, an action space as a virtual machine index in a set of virtual machines that can execute a current task type, and a load-based reward function based on a weighted sum of virtual machine layer load variance and physical machine layer load variance, wherein a negative value of the weighted sum is used as an instant reward to enable the reinforcement learning agent to simultaneously optimize load balancing of the virtual machine layer and the physical machine layer during training.
7. The method of claim 1 or 6, wherein the task schedule approximates an optimal action cost function using a deep Q network, and wherein the network parameters are updated with time differential targets by interacting with the task environment to obtain state transition samples.
8. The method according to any one of claims 1-7, wherein the method operates with a fixed duration as a decision period and the steps of workload prediction, SLA budget allocation, virtual machine provisioning and placement, reinforcement learning scheduling are repeated continuously in each period, thereby achieving end-to-end adaptive optimization under dynamic load conditions.
9. A cloud resource collaborative optimization apparatus for implementing the method of any of claims 1-8, comprising a prediction module for predicting future task demands based on historical loads, an SLA budget and resource allocation module for allocating violation budgets and calculating resource quotas by dynamic planning under global SLA constraints, a virtual machine provisioning and placement module for generating virtual machines and performing anti-affinity placement based on safe overallocation principles, and a task scheduling module for performing dual-layer load balancing scheduling based on deep reinforcement learning.

Description

Cloud data center resource collaborative optimization method with self-adaptive end-to-end service level Technical Field The invention relates to the technical field of cloud computing and data center resource management, in particular to an end-to-end service level self-adaptive resource collaborative optimization method for simultaneously optimizing resource allocation and task scheduling under the constraint of a Service Level Agreement (SLA). The method is suitable for resource allocation and task scheduling of public cloud, private cloud and hybrid cloud data centers. Background With the rapid development of cloud computing and virtualization technologies, cloud data centers have become an infrastructure form for supporting internet services and artificial intelligence applications, cloud service providers usually agree on key indexes such as availability, response delay, resource supply capacity, default compensation and the like by signing a Service Level Agreement (SLA) with users, under this background, academia and industry surround how to improve resource utilization under the premise of meeting the SLA, a great number of resource allocation and task scheduling schemes are proposed, one typical method adopts a static or semi-static allocation strategy, reserves sufficient computing resources according to a historical peak value or an upper boundary of experience in order to avoid the default, is easy to realize, but often causes long-term idle of resources, and the other method introduces a prediction mechanism, estimates future loads by means of a gray scale model, time sequence analysis, a long-short term memory network (LSTM) and the like, and then performs elastic expansion or capacity planning according to a prediction result, so as to alleviate the problem of ' excessive ' how to allocate virtual machines/computing capacities ' to a system, and how to schedule tasks deeply. On the task scheduling layer, in the prior art, how to distribute tasks arriving in real time to different virtual machines is researched on the premise of a given resource scale, a classical method comprises heuristic algorithms (such as Min-Min and Max-Min) aiming at completion time or Makespan, a priority scheduling strategy based on rules and a scheduling method based on reinforcement learning which is emerging in recent years, wherein part of work learns the scheduling strategy in a complex dynamic environment through deep reinforcement learning, and single-layer scheduling performance is effectively improved. There are also studies on loosely coupling integration of "prediction-driven resource allocation" and "on-line task scheduling", for example, a prediction model is used to give out virtual machine requirements, and then polling, greedy or simple reinforcement learning is used to allocate tasks in the running period, so that such a scheme has feasibility in engineering, but generally, resource allocation and task scheduling are regarded as two mutually independent stages, only simple connection is performed at an interface, a unified view angle is lacking in mathematical modeling and optimization targets, and it is difficult to explicitly describe the common influence of global SLA constraints on two-layer decisions, and the coupling relation between a virtual machine layer and physical machine layer load distribution is not fully incorporated. In the prior art, although more results are achieved in the aspects of load prediction, resource elastic configuration, single-layer task scheduling and the like, a plurality of defects still exist in the whole view, namely, on one hand, the resource configuration layer and the task scheduling layer are mostly split into two mutually independent optimization problems, the upper layer determines the resource supply scale according to peak values or prediction results, the lower layer performs task allocation under the established resource constraint, the lower layer lacks uniform modeling and linkage control, the phenomenon that the resource layer is seemingly sufficient but the scheduling layer generates local hot spots to cause SLA violation or the global suboptimal phenomenon that the scheduling policy is more intelligent but the resource layer is extremely conservative to cause the overall utilization ratio to be lower is easy to occur easily, on the other hand, the SLA constraint only occurs in the mode that the long-term violation rate does not exceed a certain threshold in most works, is rarely further formed into allocable and measurable "violation budgets", the method for different application types of differential allocation mechanisms is not only used, and a tool for carrying out fine trade-off between the resource cost and the violation risks is also lacked, and the global SLA and the resource cost are difficult to be controlled simultaneously under the same framework. In addition, the existing method generally lacks a safe excess allocation mechanism aiming a