DE-112024002960-T5 - AUTOMATIC DETECTION AND RESOLUTION OF HIGH PROCESSOR LOAD IN NETWORK DEVICES

DE112024002960T5DE 112024002960 T5DE112024002960 T5DE 112024002960T5DE-112024002960-T5

Abstract

A network management system can collect processor utilization statistics from one or more network devices. The network management system can determine aggregated processor utilization statistics for each of these devices over a specific time period. Based on an aggregated total processor utilization for a particular network device exceeding a baseline threshold, the system can analyze the aggregated processor utilization per process for that specific network device to identify one or more processes as the cause of the anomalous behavior of that network device. The network management system can then generate a corrective action to address the cause.

Inventors

Ruchit Rajkumar Mehta
Abhiram Madhugiri Shamsundar
Wenfeng Wang
Priti Pallavi

Assignees

JUNIPER NETWORKS, INC.

Dates

Publication Date: 20260513
Application Date: 20240823
Priority Date: 20230824

Claims (20)

A network management system comprising: a memory; and one or more processors coupled to the memory and configured to: collect processor utilization statistics of one or more network devices; determine aggregated processor utilization statistics over a time window for a given network device of the one or more network devices based on the processor utilization statistics; analyze the aggregated processor utilization per process for the given network device based on an aggregated total processor utilization for the given network device that exceeds a baseline threshold, in order to identify one or more processes as the cause of the anomalous behavior of the given network device; and generate a corrective action to address the root cause.
The network management system according to Claim 1 , wherein, to determine the aggregated processor utilization statistics for the given network device, one or more processors are configured to determine for the given network device a number of cases within the time window in which the total processor utilization of the given network device exceeds a specified high processor utilization threshold.
The network management system according to Claim 2 , wherein the one or more processors are further configured to determine that the number of times within the time window in which the total processor utilization of the given network device exceeds the specified high processor utilization threshold is greater than a high processor utilization frequency threshold; and based on the finding that the number of instances within the time window in which the total processor utilization of the given network device exceeds the specified high processor utilization threshold is greater than the high processor utilization frequency threshold, to determine that the aggregate total processor utilization for the given network device exceeds the base threshold.
The network management system according to one of the Claims 1 until 3 , wherein, to determine the aggregated processor utilization statistics for the given network device, the one or more processors n are configured to determine at least one of the following values for the given network device: an average overall processor utilization of the given network device within the time window or an average processor utilization of each process running on the given network device within the time window.
Network management system according to Claim 4 , wherein, for the purpose of analyzing the aggregated processor utilization per process for the given network device, the one or more processors are further configured to determine all network traffic routed through the given network device during the time window; and, based on all network traffic routed through the given network device during the time window and the average processor utilization of each process running on the given network device within the time window, to determine the one or more processes as the cause of the anomalous behavior of the given network device.
The network management system according to Claim 5 , wherein, to determine the one or more processes as the cause of the anomalous behavior of the given network device, the one or more processors are further configured to input all network traffic routed through the given network device during the time window and the average processor utilization of each process running on the given network device within the time window into an anomaly detection model to identify the one or more processes as the cause of the anomalous behavior of the given network device.
The network management system according to Claim 6 , where the anomaly detection model is trained using machine learning to perform heuristic-based detection of anomalous behaviors that are the cause of high processor utilization by network devices.
The network management system according to one of the Claims 6 and 7 , wherein the anomaly detection model outputs an anomaly value and wherein the one or more processors are further configured to determine that the anomaly value output by the anomaly detection model is greater than an anomaly value threshold; and Based on the finding that the anomaly value is greater than the anomaly value threshold, to determine that the high processor utilization of the given network device is caused by the anomalous behavior of the given network device.
The network management system according to one of the Claims 1 until 8 , wherein the one or more processors are configured to automatically terminate the one or more processes identified as the root cause of the anomalous behavior of the given network device in order to generate the remedial action.
The network management system according to Claim 9 , where one or more processes include a system space process that has been whitelisted for automatic termination.
A method comprising: Collecting processor utilization statistics of one or more network devices by one or more processors of a network management system; Determining aggregated processor utilization statistics over a time window for a specific network device by the one or more processors; Analyzing the aggregated processor utilization per process for the respective network device by the one or more processors based on an aggregated total processor utilization for the respective network device that exceeds a base threshold by the one or more processors to identify one or more processes as the cause of the anomalous behavior of the respective network device; and Generating a corrective action by the one or more processors to address the cause.
Procedure according to Claim 11 , wherein determining the overall processor utilization statistics for the specific network device further includes determining a number of instances within the time window in which the overall processor utilization of the specific network device exceeds a specified high processor utilization threshold by the one or more processors and for the specific network device.
The procedure according Claim 12 , which further includes: determining by the one or more processors that the number of instances within the time window in which the total processor utilization of the given network device exceeds the specified high processor utilization threshold is greater than a high processor utilization frequency threshold; and, based on the determination that the number of instances within the time window in which the total processor utilization of the given network device exceeds the specified high processor utilization threshold is greater than the high processor utilization frequency threshold, determining by the one or more processors that the aggregate total processor utilization for the given network device exceeds the base threshold.
Procedure according to one of the Claims 11 until 13 , wherein determining the aggregated processor utilization statistics over the time window based on the processor utilization statistics further comprises: determining by the one or more processors and for the given network device at least one of an average total processor utilization of the given network device within the time window or an average processor utilization of each process running on the given network device within the time window.
Procedure according to Claim 14 , wherein analyzing the aggregated processor utilization per process for the given network device further comprises: determining all network traffic routed through the given network device during the time window by the one or more processors; and determining, based on all network traffic routed through the given network device during the time window and the average processor utilization of each process running on the given network device within the time window, the one or more processes as the cause of the anomalous behavior of the given network device.
Procedure according to Claim 15 , wherein determining the one or more processes as the root cause of the anomalous behavior of the given network device further includes determining the one or more processors of all network traffic routed through the given network device during the time window and the average processor utilization of each process running on the given network device during the time window was incorporated into an anomaly detection model to identify one or more processes as the cause of the anomalous behavior of the given network device.
The procedure according Claim 16 , where the anomaly detection model is trained using machine learning to perform heuristic-based detection of anomalous behaviors that are the cause of high processor utilization by network devices.
Procedure according to one of the Claims 16 and 17 , wherein the anomaly detection model outputs an anomaly value, the method further comprising: determining by the one or more processors that the anomaly value output by the anomaly detection model is greater than an anomaly value threshold; and, based on the finding that the anomaly value is greater than the anomaly value threshold, determining by the one or more processors that the high processor utilization of the given network device is caused by the anomalous behavior of the given network device.
The procedure according to one of the Claims 11 until 18 , wherein generating the remedial action further includes the one or more processors automatically terminating the one or more processes identified as the root cause of the anomalous behavior of the given network device.
Non-transitory, computer-readable storage medium containing instructions that, when executed by one or more processors of a network management system, cause the one or more processors to: collect processor utilization statistics of one or more network devices; determine aggregated processor utilization statistics for a given network device or one or more network devices over a time window based on the processor utilization statistics; analyze the aggregated processor utilization per process for the respective network device based on an aggregate total processor utilization for the respective network device that exceeds a baseline threshold, in order to identify one or more processes as the cause of the anomalous behavior of the respective network device; and generate a corrective action to address the root cause.

Description

This application claims priority of US patent application no. 18/774,745 , filed on July 16, 2024, claiming priority over provisional IN patent application No. 202341056781, filed on August 24, 2023, the entire contents of which are hereby incorporated by reference. TECHNICAL FIELD The disclosure relates generally to computer networks and specifically to monitoring and troubleshooting in computer networks. BACKGROUND In commercial premises or locations such as offices, hospitals, airports, stadiums, or retail stores, complex wireless network systems are frequently installed, including a network of wireless access points (APs) to provide wireless network services to one or more wireless client devices (or simply "clients"). APs are physical electronic devices that allow other devices to connect wirelessly to a wired network via various wireless network protocols and technologies, such as wireless local area network protocols that conform to one or more of the IEEE 802.11 standards (i.e., "WiFi"), Bluetooth/Bluetooth Low Energy (BLE), mesh network protocols like ZigBee, or other wireless network technologies. Many different types of wireless client devices, such as... Devices such as laptops, smartphones, tablets, wearables, household appliances, and Internet of Things (IoT) devices have wireless communication technology and can be configured to connect to wireless access points when within range of a compatible access point to access a wired network. For a client device running a cloud-based application, such as Voice over Internet Protocol (VoIP), streaming video, gaming, or video conferencing applications, data is exchanged during an application session between the client device and one or more access points and one or more wired network devices, such as switches, routers, and/or gateways, to reach the cloud-based application server. SUMMARY In general, this disclosure describes techniques for detecting high processor utilization, such as high central processing unit (CPU) utilization, on network devices within a network and for remediating the detected high processor utilization on those network devices. High processor utilization can impair the routing efficiency of a network device. For example, high processor utilization can impair the expected execution of routing system processes by the processor, such as by delaying or preventing the execution of routing system processes. If the execution of routing system processes by a network device's processor is delayed or not executed by the network device's processor, the network device, as well as other network devices directly connected to it, may react as if a network problem existed, causing a failover or even a catastrophic failure of a network site. The network devices in a network can include switches, routers, gateways, or other suitable network devices capable of sending and receiving network traffic. A network can contain tens of thousands of network devices, and at any given time, hundreds or thousands of these devices may experience problems routing network traffic. Therefore, it can be time-consuming or even impractical for network administrators to manually determine which network devices experiencing routing problems are suffering from high processor utilization and to manually take remedial action to resolve this high utilization. According to aspects of this disclosure, a cloud-based network management system (NMS) can monitor the processor utilization statistics of network devices in a network, including the processor utilization statistics of processes running on the network's network devices, to detect high processor utilization on one or more network devices. The NMS can use the collected processor utilization statistics on each of the network devices exhibiting high processor utilization to determine whether the high processor utilization is caused by anomalous behavior, such as high processor utilization by one or more processes running on the processor. The NMS can identify one or more Trigger remediation actions to address the anomalous behavior on each of the one or more network devices experiencing high CPU utilization. Such remediation actions can be assigned based on a root cause analysis of the processes causing the anomalous behavior. For example, the NMS can automatically terminate one or more processes running on the processor that are the cause of the high CPU utilization. In some cases, the NMS can also recommend one or more remediation actions to a network administrator to address the anomalous behavior causing the high CPU utilization. In some cases, severity ratings are also assigned to the remediation actions based on the duration and extent of the high CPU utilization. The techniques of revelation offer one or more technical advantages and practical applications. These techniques can enable cloud-based NMS to systematically detect high processor utilization of network devices within a network, which may be caused by ano