US-12625735-B2 - Proactive resource provisioning in large-scale cloud service with intelligent pooling

US12625735B2US 12625735 B2US12625735 B2US 12625735B2US-12625735-B2

Abstract

The present application relates to a network, apparatus, and method for allocating clusters of computing nodes for programming jobs. A network includes a plurality of datacenters including computing resources configurable to instantiate nodes for executing programming jobs on a cluster. The computing resources at one of the datacenters are configured to: provision a live pool including a number of clusters, each cluster in the live pool including a plurality of nodes imaged with a configuration for executing the programming jobs in parallel on the cluster; receive a request from a user to execute a programming job; allocate a cluster from the live pool to the user for the programming job when the cluster is available; evict the cluster from the live pool; and provision a new cluster within the live pool to meet the number of clusters. The number of clusters may be optimized based on linear programming and machine-learning.

Inventors

Yiwen Zhu
Yoonjae PARK
Niharika DUTTA
Santhosh Kumar RAVINDRAN
Alex YEO
Harsha Nihanth NAGULAPALLI
Sumeet Khushalani
Arijit TARAFDAR
Subramaniam VENKATRAMAN KRISHNAN
Deepak Ravikumar
Andrew Francis FOGARTY
Steve D Suh

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260512
Application Date: 20230110

Claims (17)

1 . An apparatus for provisioning resources for requests to execute programming jobs on a cluster of nodes, comprising: a memory storing computer-executable instructions; and at least one processor configured to execute the computer-executable instructions to: provision a live pool including a number of clusters, each cluster in the live pool including a plurality of nodes imaged with a configuration for executing the programming jobs in parallel on the cluster; receive a request from a user to execute a programming job; allocate a cluster from the live pool to the user for the programming job when the cluster is available; evict the cluster from the live pool in response to allocating the cluster to the user; provision a new cluster within the live pool to meet the number of clusters in response to evicting the cluster; and dynamically scale the number of clusters in the live pool based on a joint optimization of predicted allocation latency when no cluster is available and a predicted idle time of clusters in the live pool, wherein the joint optimization includes a linear program configured to optimize the predicted allocation latency when no cluster is available and the predicted idle time of the number of clusters in the live pool based on at least a rate of requests from users and a hyperparameter that weights the predicted allocation latency when no cluster is available or the predicted idle time of clusters in the live pool.
2 . The apparatus of claim 1 , wherein the at least one processor is further configured to execute the instructions to cause the apparatus to: receive an indication of an acceptable allocation latency; calculate the number of clusters that optimizes the predicted allocation latency when no cluster is available and the predicted idle time for a plurality of values of the hyperparameter; and select one of the plurality of values of the hyperparameter that satisfies the acceptable allocation latency.
3 . The apparatus of claim 1 , wherein the rate of requests from users is a predicted rate of requests for a time period based on a machine-learning model trained to forecast a time series.
4 . The apparatus of claim 3 , wherein to dynamically scale the number of clusters, the at least one processor is configured to execute the instructions to cause the apparatus to: train the machine-learning model based on historical request rates to forecast the time series; and provide the forecasted time series to the linear program to optimize the number of clusters in the live pool.
5 . The apparatus of claim 3 , wherein to dynamically scale the number of clusters, the at least one processor is configured to execute the instructions to cause the apparatus to: apply historical request rates to the linear program to determine a ground truth of historical optimal pool size for the historical request rates; and train the machine-learning model to forecast a time series of an optimal value of the number of clusters based on the historical optimal pool size.
6 . The apparatus of claim 3 , wherein the machine-learning model is a hybrid model including a singular spectrum analysis (SSA) and a neural network with at least two layers.
7 . The apparatus of claim 1 , wherein to provision the live pool, the at least one processor is configured to execute the instructions to cause the apparatus to establish an actively running programming job session on each cluster in the live pool.
8 . The apparatus of claim 1 , wherein to allocate the cluster from the live pool to the user for the programming job, the at least one processor is configured to execute the instructions to cause the apparatus to configure the cluster to execute a batch job.
9 . A method of provisioning resources for requests to execute programming jobs on a cluster of nodes, comprising: provisioning a live pool including a number of clusters, each cluster in the live pool including a plurality of nodes imaged with a configuration for executing the programming jobs in parallel on the cluster; receiving a request from a user to execute a programming job; allocating a cluster from the live pool to the user for the programming job when the cluster is available; evicting the cluster from the live pool in response to allocating the cluster to the user; provisioning a new cluster within the live pool to meet the number of clusters in response to evicting the cluster; and dynamically scaling the number of clusters in the live pool based on a joint optimization of predicted allocation latency when no cluster is available and a predicted idle time of clusters in the live pool, wherein the joint optimization includes a linear program configured to optimize the predicted allocation latency when no cluster is available and the predicted idle time of the number of clusters in the live pool based on at least a rate of requests from users and a hyperparameter that weights the predicted allocation latency when no cluster is available or the predicted idle time of clusters in the live pool.
10 . The method of claim 9 , further comprising: receiving an indication of an acceptable allocation latency; calculating the number of clusters that optimizes the predicted allocation latency when no cluster is available and the predicted idle time for a plurality of values of the hyperparameter; and selecting one of the plurality of values of the hyperparameter that satisfies the acceptable allocation latency.
11 . The method of claim 9 , wherein the rate of requests from users is a predicted rate of requests for a time period based on a machine-learning model trained to forecast a time series.
12 . The method of claim 11 , wherein dynamically scaling the number of clusters comprises: training the machine-learning model based on historical request rates to forecast the time series; and providing the forecasted time series to the linear program to optimize the number of clusters in the live pool.
13 . The method of claim 11 , wherein dynamically scaling the number of clusters comprises: applying historical request rates to the linear program to determine a ground truth of historical optimal pool size for the historical request rates; and training the machine-learning model to forecast a time series of an optimal value of the number of clusters based on the historical optimal pool size.
14 . The method of claim 11 , wherein the machine-learning model is a hybrid model including a singular spectrum analysis (SSA) and a neural network with at least two layers.
15 . The method of claim 9 , wherein provisioning the live pool comprises establishing an actively running programming job session on each cluster in the live pool.
16 . The method of claim 9 , wherein allocating the cluster from the live pool to the user for the programming job comprises configuring the cluster to execute a batch job.
17 . A wide area network, comprising: a plurality of datacenters, each datacenter including computing resources configurable to instantiate at least one node of a cluster for executing programming jobs on the cluster of nodes, wherein the computing resources at one or more of the plurality of datacenters are configured to: provision a live pool including a number of clusters, each cluster in the live pool including a plurality of nodes imaged with a configuration for executing the programming jobs in parallel on the cluster; receive a request from a user to execute a programming job; allocate a cluster from the live pool to the user for the programming job when the cluster is available; evict the cluster from the live pool in response to allocating the cluster to the user; provision a new cluster within the live pool to meet the number of clusters in response to evicting the cluster; and dynamically scale the number of clusters in the live pool based on a joint optimization of predicted allocation latency when no cluster is available and a predicted idle time of clusters in the live pool, wherein the joint optimization includes a linear program configured to optimize the predicted allocation latency when no cluster is available and the predicted idle time of the number of clusters in the live pool based on at least a rate of requests from users and a hyperparameter that weights the predicted allocation latency when no cluster is available or the predicted idle time of clusters in the live pool.

Description

BACKGROUND A cloud network may be implemented on a wide area network (WAN) that includes computing resources spread across a geographic region and connected via communication links such as fiber optic cables or satellite connectivity. A cloud provider may host cloud applications for its clients. For example, a cloud provider may provide infrastructure as a service (IaaS) services such as virtual machines (VM), platform as a service (PaaS) services such as databases and serverless computing, and software as a service (SaaS) services such as authentication platforms. The size of wide area networks may vary greatly from a small city to a global network. For example, a WAN may connect multiple offices of an enterprise, the customers of a regional telecommunications operator, or a global enterprise. The computing resources and connections within a WAN may be owned and controlled by the WAN operator. Cloud computing has emerged as a top choice for executing big data analytic workloads in various business domains. To cope with rapidly-increasing demand, cloud vendors have funneled sizable resources into their own managed programming job services. For instance, Apache Spark is an example of a programming job service that executes on a coordinated cluster of virtual machines (VMs). Cloud vendors have implemented Spark services as IaaS, providing a cluster computing infrastructure to users. The flexibility of such cloud offerings allows users to easily lease and release compute resources and, consequently, enjoy potentially significant cost effectiveness. However, to provide such flexibility, service providers must address various challenges with respect to resource provisioning. The complexity of such multi-tenancy and flexibility introduces a long latency to access clusters (due to resource pro-visioning, network configuring, authentication, imaging, etc.), impacting the Quality of Service (QoS). Even with significant effort to reduce the session preparation latency (such as starting jobs before all nodes are ready or introducing “healing” to initiate more VMs in case of long provisioning time), a long waiting time occurs between a user request and availability of the cluster. Observed cluster initialization time is typically greater than 60 seconds. This provisioning time can even be longer than the job execution time itself. Accordingly, there is a need to reduce the time that a customer waits for a cluster to be available for executing a programming job. SUMMARY The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later. In some aspects, the techniques described herein relate to an apparatus for provisioning resources for requests to execute programming jobs on a cluster of nodes, including: a memory storing computer-executable instructions; and at least one processor configured to execute the computer-executable instructions to: provision a live pool including a number of clusters, each cluster in the live pool including a plurality of nodes imaged with a configuration for executing the programming jobs in parallel on the cluster; receive a request from a user to execute a programming job; allocate a cluster from the live pool to the user for the programming job when the cluster is available; evict the cluster from the live pool; and provision a new cluster within the live pool to meet the number of clusters. In some aspects, the techniques described herein relate to an apparatus, wherein the at least one processor is configured to execute the instructions to cause the apparatus to dynamically scale the number of clusters in the live pool based on a joint optimization of predicted allocation latency when no cluster is available and a predicted idle time of clusters in the live pool. In some aspects, the techniques described herein relate to an apparatus, wherein the joint optimization includes a linear program configured to optimize the predicted allocation latency when no cluster is available and the predicted idle time of the number of clusters in the live pool based on at least a rate of requests from users and a hyperparameter that weights the predicted allocation latency when no cluster is available or the predicted idle time of clusters in the live pool. In some aspects, the techniques described herein relate to an apparatus, wherein the at least one processor is further configured to execute the instructions to cause the apparatus to: receive an indication of an acceptable allocation latency; calculate the number of clusters that optimizes the predicted allocation latency when no cluster is a