US-12620299-B2 - Server failure prediction using machine learning

US12620299B2US 12620299 B2US12620299 B2US 12620299B2US-12620299-B2

Abstract

A method comprises collecting operational data corresponding to one or more servers, wherein the operational data comprises a plurality of values corresponding to at least one metric, and analyzing the operational data using one or more time series forecasting machine learning algorithms to predict a plurality of future values corresponding to the at least one metric. The plurality of the future values are compared to at least one threshold value for the at least one metric to determine whether at least a subset of the plurality of the future values satisfies one or more conditions associated with the at least one threshold value. An alert corresponding to operation of the one or more servers is automatically generated responsive to at least the subset of the plurality of the future values satisfying the one or more conditions. The alert is transmitted to at least one user device.

Inventors

Ooi Aik Hooi
Parminder Singh Sethi

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260505
Application Date: 20231129

Claims (20)

1 . A method comprising: collecting operational data corresponding to one or more servers, wherein the operational data comprises a plurality of values corresponding to at least one metric; analyzing the operational data using one or more time series forecasting machine learning algorithms to predict a plurality of future values corresponding to the at least one metric; comparing the plurality of the future values to at least one threshold value for the at least one metric to determine whether at least a subset of the plurality of the future values satisfies one or more conditions associated with the at least one threshold value; and automatically generating an alert corresponding to operation of the one or more servers responsive to at least the subset of the plurality of the future values satisfying the one or more conditions, wherein the alert is transmittable to at least one user device; wherein the subset of the plurality of the future values comprises a grouping of the plurality of future values over a designated time period, and the one or more conditions comprises a data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value; wherein the steps of the method are executed by a processing device operatively coupled to a memory.
2 . The method of claim 1 wherein the alert identifies at least one issue with the one or more servers and at least one remedial action to address the at least one issue.
3 . The method of claim 1 wherein the at least one metric comprises at least one of input-output operations per second, throughput, latency, central processing unit utilization, storage utilization, memory utilization, bandwidth and replication lag time.
4 . The method of claim 1 wherein the operational data comprises historical data and live data.
5 . The method of claim 1 wherein: the one or more conditions comprises a majority of the data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value.
6 . The method of claim 1 wherein the one or more time series forecasting machine learning algorithms are based at least in part on a logistic growth function.
7 . The method of claim 1 wherein the operational data that is analyzed by the one or more time series forecasting machine learning algorithms is cumulative over a designated time period.
8 . The method of claim 1 wherein the one or more time series forecasting machine learning algorithms are configured to automatically identify a set of hyperparameters for predicting the plurality of future values over a future time period.
9 . The method of claim 1 further comprising generating a first logarithmic curve corresponding to the plurality of values over a past time period and generating a second logarithmic curve corresponding to the plurality of future values over a future time period.
10 . The method of claim 9 wherein the one or more conditions correspond to differences between the second logarithmic curve and a line representing a constant value over the future time period, wherein the constant value is equal to the at least one threshold value.
11 . An apparatus comprising: a processing device operatively coupled to a memory and configured: to collect operational data corresponding to one or more servers, wherein the operational data comprises a plurality of values corresponding to at least one metric; to analyze the operational data using one or more time series forecasting machine learning algorithms to predict a plurality of future values corresponding to the at least one metric; to compare the plurality of the future values to at least one threshold value for the at least one metric to determine whether at least a subset of the plurality of the future values satisfies one or more conditions associated with the at least one threshold value; and to automatically generate an alert corresponding to operation of the one or more servers responsive to at least the subset of the plurality of the future values satisfying the one or more conditions, wherein the alert is transmittable to at least one user device; wherein the subset of the plurality of the future values comprises a grouping of the plurality of future values over a designated time period, and the one or more conditions comprises a data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value.
12 . The apparatus of claim 11 wherein: the one or more conditions comprises a majority of the data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value.
13 . The apparatus of claim 11 wherein the one or more time series forecasting machine learning algorithms are configured to automatically identify a set of hyperparameters for predicting the plurality of future values over a future time period.
14 . The apparatus of claim 11 wherein the processing device is further configured to generate a first logarithmic curve corresponding to the plurality of values over a past time period and to generate a second logarithmic curve corresponding to the plurality of future values over a future time period.
15 . The apparatus of claim 14 wherein the one or more conditions correspond to differences between the second logarithmic curve and a line representing a constant value over the future time period, wherein the constant value is equal to the at least one threshold value.
16 . An article of manufacture comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes said at least one processing device to perform the steps of: collecting operational data corresponding to one or more servers, wherein the operational data comprises a plurality of values corresponding to at least one metric; analyzing the operational data using one or more time series forecasting machine learning algorithms to predict a plurality of future values corresponding to the at least one metric; comparing the plurality of the future values to at least one threshold value for the at least one metric to determine whether at least a subset of the plurality of the future values satisfies one or more conditions associated with the at least one threshold value; and automatically generating an alert corresponding to operation of the one or more servers responsive to at least the subset of the plurality of the future values satisfying the one or more conditions, wherein the alert is transmittable to at least one user device; wherein the subset of the plurality of the future values comprises a grouping of the plurality of future values over a designated time period, and the one or more conditions comprises a data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value.
17 . The article of manufacture of claim 16 wherein: the one or more conditions comprises a majority of the data pattern generated by the grouping of the plurality of future values over the designated time period at least one of meeting and exceeding the at least one threshold value.
18 . The article of manufacture of claim 16 wherein the one or more time series forecasting machine learning algorithms are configured to automatically identify a set of hyperparameters for predicting the plurality of future values over a future time period.
19 . The article of manufacture of claim 16 wherein the program code further causes said at least one processing device to perform the steps of generating a first logarithmic curve corresponding to the plurality of values over a past time period and generating a second logarithmic curve corresponding to the plurality of future values over a future time period.
20 . The article of manufacture of claim 19 wherein the one or more conditions correspond to differences between the second logarithmic curve and a line representing a constant value over the future time period, wherein the constant value is equal to the at least one threshold value.

Description

FIELD The field relates generally to information processing systems, and more particularly to server management in such information processing systems. BACKGROUND Monitoring platforms for databases and other types of systems collect data corresponding to device operation. Current monitoring mechanisms are reactive to operational problems. As a result, alerts about device issues do not reach administrators or technical support personnel until after the occurrence of device failure or degradation. Additionally, current reactive monitoring mechanisms can result in false alerts or alerts that may not be able to be acted on because detected operational data can change in a short period of time. For example, once a notification reaches an administrator, detected values may no longer be visible in a system. SUMMARY Embodiments provide a failure prediction and resolution recommendation platform in an information processing system. For example, in one embodiment, a method comprises collecting operational data corresponding to one or more servers, wherein the operational data comprises a plurality of values corresponding to at least one metric, and analyzing the operational data using one or more time series forecasting machine learning algorithms to predict a plurality of future values corresponding to the at least one metric. The plurality of the future values are compared to at least one threshold value for the at least one metric to determine whether at least a subset of the plurality of the future values satisfies one or more conditions associated with the at least one threshold value. An alert corresponding to operation of the one or more servers is automatically generated responsive to at least the subset of the plurality of the future values satisfying the one or more conditions. The alert is transmitted to at least one user device. Further illustrative embodiments are provided in the form of a non-transitory computer-readable storage medium having embodied therein executable program code that when executed by a processor causes the processor to perform the above steps. Still further illustrative embodiments comprise an apparatus with a processor and a memory configured to perform the above steps. These and other features and advantages of embodiments described herein will become more apparent from the accompanying drawings and the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a failure prediction and resolution recommendation platform for predicting server failure and recommending remedial actions to avoid such failure in an illustrative embodiment. FIG. 2 depicts an architecture of a failure prediction and resolution recommendation platform on a server in an illustrative embodiment. FIG. 3 depicts a first graph illustrating plots of actual and predicted operational data with respect to a designated threshold for triggering an alert in an illustrative embodiment. FIG. 4 depicts a second graph illustrating plots of actual and predicted operational data with respect to a designated threshold for triggering an alert in an illustrative embodiment. FIG. 5 depicts a process for server failure prediction according to an illustrative embodiment. FIGS. 6 and 7 show examples of processing platforms that may be utilized to implement at least a portion of an information processing system according to illustrative embodiments. DETAILED DESCRIPTION Illustrative embodiments will be described herein with reference to exemplary information processing systems and associated computers, servers, storage devices and other processing devices. It is to be appreciated, however, that embodiments are not restricted to use with the particular illustrative system and device configurations shown. Accordingly, the term “information processing system” as used herein is intended to be broadly construed, so as to encompass, for example, processing systems comprising cloud computing and storage systems, as well as other types of processing systems comprising various combinations of physical and virtual processing resources. An information processing system may therefore comprise, for example, at least one data center or other type of cloud-based system that includes one or more clouds hosting tenants that access cloud resources. Such systems are considered examples of what are more generally referred to herein as cloud-based computing environments. Some cloud infrastructures are within the exclusive control and management of a given enterprise, and therefore are considered “private clouds.” The term “enterprise” as used herein is intended to be broadly construed, and may comprise, for example, one or more businesses, one or more corporations or any other one or more entities, groups, or organizations. An “entity” as illustratively used herein may be a person or system. On the other hand, cloud infrastructures that are used by multiple enterprises, and not necessarily controlled or managed by any of the mult