US-12627729-B2 - Automated server workload management using machine learning

US12627729B2US 12627729 B2US12627729 B2US 12627729B2US-12627729-B2

Abstract

Systems and methods are disclosed for managing workload among server clusters is disclosed. According to certain embodiments, the system may include a memory storing instructions and a processor. The processor may be configured to execute the instructions to determine historical behaviors of the server clusters in processing a workload. The processor may also be configured to execute the instructions to construct cost models for the server clusters based at least in part on the historical behaviors. The cost model is configured to predict a processor utilization demand of a workload. The processor may further be configured to execute the instructions to receive a workload and determine efficiencies of processing the workload by the server clusters based at least in part on at least one of the cost models or an execution plan of the workload.

Inventors

Subodh Kumar
Santosh BARDWAJ

Assignees

CAPITAL ONE SERVICES, LLC

Dates

Publication Date: 20260512
Application Date: 20240503

Claims (20)

1 . A computing system implemented as a server cluster, wherein the server cluster comprises a part of a workload management system, the computing system comprising: one or more processors programmed to execute instructions that configure the one or more processors to: receive a request for capacity information for the server cluster; determine, based on the capacity information, a processing cost for the server cluster to process a workload based on an evaluation of processing costs for component tasks of the workload; transmit, to the workload management system, (i) metadata indicating that the server cluster is able to process one or more portions of the workload and (ii) the processing cost; receive one or more of the component tasks, wherein the component tasks are distributed to a set of server clusters selected based on a processing cost determined for each of the set of server clusters; and transmit, to the workload management system, a result of processing the one or more of the component tasks.
2 . The computing system of claim 1 , further comprising: linked nodes configured to collectively run one or more applications.
3 . The computing system of claim 1 , wherein the one or more processors are further programmed to: store historical performance data of the server cluster with processing one or more workloads; analyze, using one or more machine learning models, the historical performance data to identify behaviors of the server cluster for processing the one or more workloads.
4 . The computing system of claim 3 , wherein the capacity information comprises hardware capacities of hardware resources of the server cluster, wherein the behaviors are based the hardware capacities of the hardware resources and utilizations of the hardware resources, and wherein a processing speed of the server cluster is based on the hardware capacities.
5 . The computing system of claim 4 , wherein the processing cost being determined comprises the one or more processors being configured to: construct a cost model to predict the processing cost based on the hardware capacities and utilizations of the hardware resources.
6 . The computing system of claim 4 , wherein one or more machine learning models are run to determine a correlation between the utilizations of the hardware resources and the hardware capacities of the hardware resources to process workloads.
7 . The computing system of claim 1 , wherein the processing cost is a lowest processing cost computed from among alternative plans for executing the workload on the server cluster.
8 . The computing system of claim 1 , wherein the one or more processors are further configured to: determine an importance level of the workload, wherein the component tasks are assigned based on the importance level of the workload.
9 . One or more non-transitory computer-readable media storing computer program instructions that, when executed by one or more processors of a server cluster, effectuate operations comprising: receiving a request for capacity information for the server cluster; determining, based on the capacity information, a processing cost for the server cluster to process a workload based on an evaluation of processing costs for component tasks of the workload; transmitting, to a workload management system, (i) metadata indicating that the server cluster is able to process one or more portions of the workload and (ii) the processing cost; receiving one or more of the component tasks, wherein the component tasks are distributed to a set of server clusters selected based on a processing cost determined for each of the set of server clusters; and transmitting, to the workload management system, a result of processing the one or more of the component tasks.
10 . The one or more non-transitory computer-readable media of claim 9 , wherein the server cluster comprises linked nodes configured to collectively run one or more applications.
11 . The one or more non-transitory computer-readable media of claim 9 , wherein the operations further comprise: storing historical performance data of the server cluster with processing one or more workloads; and analyzing, using one or more machine learning models, the historical performance data to identify behaviors of the server cluster for processing the one or more workloads.
12 . The one or more non-transitory computer-readable media of claim 11 , wherein the capacity information comprises hardware capacities of hardware resources of the server cluster, wherein the behaviors are based the hardware capacities of the hardware resources and utilizations of the hardware resources, and wherein a processing speed of the server cluster is based on the hardware capacities.
13 . The one or more non-transitory computer-readable media of claim 12 , wherein determining the processing cost comprises: constructing a cost model to predict the processing cost based on the hardware capacities and utilizations of the hardware resources.
14 . The one or more non-transitory computer-readable media of claim 12 , wherein one or more machine learning models are run to determine a correlation between the utilizations of the hardware resources and the hardware capacities of the hardware resources to process workloads.
15 . The one or more non-transitory computer-readable media of claim 9 , wherein the processing cost is a lowest processing cost computed from among alternative plans for executing the workload on the server cluster.
16 . The one or more non-transitory computer-readable media of claim 9 , wherein the operations further comprise: determining an importance level of the workload, wherein the component tasks are assigned based on the importance level of the workload.
17 . A method, comprising: receiving a request for capacity information for a server cluster; determining, based on the capacity information, a processing cost for the server cluster to process a workload based on an evaluation of processing costs for component tasks of the workload; transmitting, to a workload management system, (i) metadata indicating that the server cluster is able to process one or more portions of the workload and (ii) the processing cost; receiving one or more of the component tasks, wherein the component tasks are distributed to a set of server clusters selected based on a processing cost determined for each of the set of server clusters; and transmitting, to the workload management system, a result of processing the one or more of the component tasks.
18 . The method of claim 17 , further comprising: storing historical performance data of the server cluster with processing one or more workloads; and analyzing, using one or more machine learning models, the historical performance data to identify behaviors of the server cluster for processing the one or more workloads.
19 . The method of claim 18 , wherein the capacity information comprises hardware capacities of hardware resources of the server cluster, wherein the behaviors are based the hardware capacities of the hardware resources and utilizations of the hardware resources, and wherein a processing speed of the server cluster is based on the hardware capacities, determining the processing cost comprises: constructing a cost model to predict the processing cost based on the hardware capacities and utilizations of the hardware resources.
20 . The method of claim 19 , further comprising: running one or more machine learning models are run to determine a correlation between the utilizations of the hardware resources and the hardware capacities of the hardware resources to process workloads.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/183,928, filed Mar. 14, 2023, which is a continuation of U.S. patent application Ser. No. 17/711,936, filed Apr. 1, 2022, which is a continuation of U.S. patent application Ser. No. 16/743,819, filed Jan. 15, 2020, which is a continuation of U.S. patent application Ser. No. 15/870,262, filed Jan. 12, 2018, which is a continuation U.S. patent application Ser. No. 15/337,486, filed Oct. 28, 2016, which claims the benefit of priority of U.S. Provisional Application No. 62/248,166, filed Oct. 29, 2015, which applications are incorporated herein in their entirety by reference. TECHNICAL FIELD The present disclosure provides an automated system and method for managing workload amongst multiple computers, processors, and/or clusters of computers/processors. In particular, the disclosed system and method address problems related to optimizing computer processing efficiency by, among other things, using machine learning to study the utilization and performance of the computing resources and distributing workload amongst the computing resources based on the study. BACKGROUND The “Big Data” environment refers to a computing environment running computationally intensive and data-intensive jobs that cannot be feasibly implemented in a traditional manner on a computing system. Thus, the Big Data environment often employs multiple types and generations of computing systems organized into server clusters, grids, data centers, and clouds. In this highly heterogeneous environment, different workloads compete for available hard resources like central processing unit (CPU) capacities, memories, storage space, input/output (I/O) channels, network bandwidth, and soft resources like available server processes. Workload management is thus essential to ensuring that the use of all resources is optimized and that the workload is run with maximum efficiency. Traditionally, administrators of the Big Data environment monitor the environment and track any abnormalities. For example, in an environment containing multiple server clusters, the administrators may frequently move workloads from overloaded clusters to lightly-used clusters. Also, for example, the administrators may use knowledge acquired over years to identify jobs that are inefficient and take corrective actions, such as terminating the jobs, providing recommendations about how to improve the coding qualities based on observed behaviors of the jobs, etc. But as the Big Data environment becomes increasingly more complex and ever changing, the administrators face at least three challenges. First, accurate workload management requires analysis of multiple machine and job metrics. Hundreds of metrics and their correlations may be needed to paint a complete picture of workload complexities, software dependencies, resource utilizations, and hardware configurations. It may be impossible for the administrators to monitor these metrics with enough granularity to effectively account for abnormalities. Second, multiple tools are used to access data in the Big Data environment, and the different tools have different behaviors. Because different jobs may be coded using different tools, this makes it very difficult for the administrators to diagnose the coding qualities of the jobs and to give useful recommendations. Third, Big Data systems change behavior when the underlying hardware configurations and capacities change, a continuing event that administrators cannot observe—much less account for—as an observer. For example, if new server clusters are added into the environment or old servers in a cluster are replaced with new ones, the administrators cannot readily adjust their understanding about the hardware resources and thus cannot provide accurate advice. For the above reasons, the current workload management in the Big Data environment is mainly reactive in nature. Because there is no mechanism to predict how a job will behave in the environment and what the cost to process the job will be, existing systems can only lake remedial measures after system anomalies are detected and many hours of computing power are wasted. Moreover, because the skills and experiences of the administrators vary, it is impossible to provide consistent and automated guidance to manage the Big Data environment. In view of the shortcomings and problems with traditional workload management systems, an improved system and method for server workload management is desired. SUMMARY The disclosed embodiments provide methods and systems for automated server workload management using machine learning. In particular, the disclosed systems and methods predict a distributed computing system's efficiency of processing a workload by using machine learning algorithms to study historical behavior of the system. Thus, proactive measures may be taken to manage the workload to prevent system abnormalities. Moreov