US-20260127058-A1 - IDENTIFYING AND REMEDIATING OVERHEATING DEVICES
Abstract
This disclosure describes techniques for intelligently detecting overheating devices in a network or data center and taking actions to address such overheating devices. This disclosure also describes evaluating heat dissipation information associated with components of devices in a network, making predictions about network disruptions based on the evaluation of the heat dissipation information, and taking actions to address, mitigate, or prevent such network disruptions. In one example, this disclosure describes a method that includes collecting, by a computing system, information about thermal metrics for a plurality of network devices; identifying, by the computing system and based on the information about the thermal metrics, a specific network device that changes temperature quickly; and taking action, by the computing system, to address effects of overheating associated with the specific network device.
Inventors
- Ganesh Byagoti Matad Sunkada
- Thayumanavan Sridhar
- Raja Kommula
- Rajendra Shivaram Yavatkar
Assignees
- JUNIPER NETWORKS, INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20250929
- Priority Date
- 20241108
Claims (20)
- 1 . A computing system comprising processing circuitry and storage media, wherein the processing circuitry has access to the storage media and is configured to: collect information about thermal metrics for a plurality of network devices; identify, based on the information about thermal metrics, a specific network device that is at risk of overheating; and take action to address effects of overheating associated with the specific network device.
- 2 . The computing system of claim 1 , wherein to collect information about thermal metrics, the processing circuitry is further configured to: collect information about heat dissipation associated with each of the plurality of network devices.
- 3 . The computing system of claim 2 , wherein each of the plurality of network devices includes a plurality of components, and wherein to collect information about heat dissipation associated with each of the plurality of network devices, the processing circuitry is further configured to: collect, for each of the network devices, information about heat dissipation across the plurality of components included within each network device.
- 4 . The computing system of claim 3 , wherein to identify the specific network device, the processing circuitry is further configured to: assess, based on the information about heat dissipation, cooling efficiency of at least some of the plurality of components within each network device of the plurality of network devices.
- 5 . The computing system of claim 4 , wherein to identify the specific network device, the processing circuitry is further configured to: determine, based on the assessment, that the specific network device has a component at risk of failure.
- 6 . The computing system of claim 1 , wherein to collect information about thermal metrics, the processing circuitry is further configured to: collect temperature data from sensors associated with each of the plurality of network devices.
- 7 . The computing system of claim 6 , wherein each of the plurality of network devices has a chassis, and wherein to collect temperature data from the sensors, the processing circuitry is further configured to: collect temperature data from sensors placed at key locations on the chassis of each of the plurality of network devices.
- 8 . The computing system of claim 1 , wherein to collect the information about thermal metrics, the processing circuitry is further configured to: store the information in a time-series data store; and enable periodic time-series analysis, based on the stored information, of temperature metrics for each of the plurality of network devices.
- 9 . The computing system of claim 1 , wherein to identify a specific network device that is at risk of overheating, the processing circuitry is further configured to: identify a specific network device that overheats quickly.
- 10 . The computing system of claim 1 , wherein to take action to address the effects of overheating associated with the specific network device, the processing circuitry is further configured to: generate an alert providing information about overheating associated with the specific network device; and enable an administrator to take action.
- 11 . The computing system of claim 10 , wherein to generate the alert providing information, the processing circuitry is further configured to: include information recommending a rearrangement in which the specific network device is relocated to a location with better air circulation.
- 12 . The computing system of claim 1 , wherein to take action to address the effects of overheating associated with the specific network device, the processing circuitry is further configured to: reallocate a workload by removing the workload from the specific network device.
- 13 . The computing system of claim 2 , wherein to collect the information about heat dissipation associated with each of the plurality of network devices, the processing circuitry is further configured to: store time series data associated with heat dissipation metrics.
- 14 . The computing system of claim 13 , wherein to identify the specific network device that shows signs of overheating, the processing circuitry is further configured to: train a machine learning model, based on at least some of the time series data, to predict heat dissipation patterns for components within network devices; apply the machine learning model to predict that the specific network device has a component at risk of failure.
- 15 . The computing system of claim 1 , wherein to take action to address the effects of overheating associated with the specific network device, the processing circuitry is further configured to: send control signals to another system, instructing the other system to perform an operation to address the effects of overheating associated with the specific network device.
- 16 . A method comprising: collecting, by a computing system, information about thermal metrics for a plurality of network devices; identifying, by the computing system and based on the information about thermal metrics, a specific network device that is at risk of overheating; and taking action, by the computing system, to address effects of overheating associated with the specific network device.
- 17 . The method of claim 16 , wherein collecting information about thermal metrics includes: collecting information about heat dissipation associated with each of the plurality of network devices.
- 18 . The method of claim 17 , wherein each of the plurality of network devices includes a plurality of components, and wherein collecting information about heat dissipation associated with each of the plurality of network devices includes: collecting, for each of the network devices, information about heat dissipation across the plurality of components included within each network device.
- 19 . The method of claim 18 , wherein identifying the specific network device includes: assessing, based on the information about heat dissipation, cooling efficiency of at least some of the components of the plurality of network devices.
- 20 . Non-transitory computer-readable media comprising instructions that, when executed, cause processing circuitry of a computing system to: collect information about thermal metrics for a plurality of network devices; identify, based on the information about thermal metrics, a specific network device that is at risk of overheating; and take action to address effects of overheating associated with the specific network device.
Description
This application claims the benefit of India Provisional Patent Application No. 202441086013 which was filed on Nov. 7, 2024, the entire content of which is incorporated herein by reference. TECHNICAL FIELD This disclosure relates to computer networks and, more specifically, to managing heat generated in a data center. BACKGROUND Excessive heat can have significant detrimental effects on data centers. Elevated temperatures can lead to hardware failures, resulting in system outages and potential data loss. Additionally, high temperatures can compromise the performance of servers, causing slowdowns that affect the overall efficiency of the data center. Prolonged exposure to heat can accelerate the degradation of electronic components, leading to increased maintenance costs and the need for more frequent replacements. In general, inadequate thermal management poses serious risks to the reliability and operational continuity of data centers. SUMMARY This disclosure describes techniques for intelligently detecting overheating devices in a network or data center and taking actions to address such overheating devices. This disclosure also describes evaluating heat dissipation information associated with components of devices in a network, making predictions about network disruptions based on the evaluation of the heat dissipation information, and taking actions to address, mitigate, or prevent such network disruptions. In some examples, this disclosure describes operations performed by a computing system in accordance with one or more aspects of this disclosure. In one specific example, this disclosure describes a method comprising: collecting, by a computing system, information about thermal metrics for a plurality of network devices; identifying, by the computing system and based on the information about the thermal metrics, a specific network device that changes temperature quickly; and taking action, by the computing system, to address effects of overheating associated with the specific network device. In another example, this disclosure describes a method comprising: collecting, by a computing system, information about heat dissipation associated with each of a plurality of network devices, wherein each of the network devices includes a plurality of components, and wherein collecting the information about heat dissipation includes collecting, for each of the network devices, information about heat dissipation across the plurality of components included within each network device; assessing, by the computing system and based on the information about heat dissipation, cooling efficiency of at least some of the components of the plurality of network devices; and identifying, by the computing system and based on the assessment, a specific network device having a component with an increased risk of failure. In another example, this disclosure describes a system comprising a storage system and processing circuitry having access to the storage system, wherein the processing circuitry is configured to carry out operations described herein. In yet another example, this disclosure describes a computer-readable storage medium comprising instructions that, when executed, configure processing circuitry of a computing system to carry out operations described herein. This Summary is intended to provide a brief overview of some of the subject matter described in this document. Accordingly, the above-described features are merely examples and should not be construed to narrow the scope or spirit of the subject matter described herein. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a block diagram illustrating an example system including a data center in which examples of the techniques described herein may be implemented. FIG. 2A, FIG. 2B, and FIG. 2C are conceptual diagrams of an arrangement of devices within racks in a data center, in accordance with one or more aspects of the present disclosure. FIG. 3A and FIG. 3B are conceptual diagrams of devices within a rack in a data center, where heat dissipation information is collected from the devices, in accordance with one or more aspects of the present disclosure. FIG. 4 is a flow diagram illustrating operations performed by an example controller in accordance with one or more aspects of the present disclosure. FIG. 5 is a flow diagram illustrating operations performed by an example computing system in accordance with one or more aspects of the present disclosure. DETAILED DESCRIPTION FIG. 1 is a block diagram illustrating an example system 8 including a data center in which examples of the techniques described herein may be implemented. In general, data center 100 provides an operating environment for applications and services for one or more customer sites 11 (illustrated as “customers 11”) having one or more customer networks coupled to the data center