Search

US-20260127059-A1 - SYSTEM AND METHOD FOR MONITORING AND PREDICTING HEALTH CONDITION OF MULTIPLE COMPONENTS

US20260127059A1US 20260127059 A1US20260127059 A1US 20260127059A1US-20260127059-A1

Abstract

A system and method for monitoring and predicting a health condition for a computing system having multiple components are provided. The method includes: acquiring a plurality of time sequences of data (e.g., telemetry data) associated with a plurality of components of the computing system, and processing the plurality of time sequences of data to generate one or more predictive outputs predicting whether any of the plurality of components will be abnormal (e.g., to malfunction). The method also includes determining one or more mitigation actions in response to determining that the one or more predictive outputs include a first predictive output indicating that a first component, of the plurality of components, will be in an abnormal condition (e.g., malfunction at a predicted time). The method further includes performing the one or more mitigation actions (e.g., prior to the predicted time). The first component may be a hardware component or other component.

Inventors

  • Chi Yuan Hsu

Assignees

  • Aivres Systems Inc.

Dates

Publication Date
20260507
Application Date
20251230

Claims (20)

  1. 1 . A method, comprising: acquiring a plurality of time sequences of data associated with a plurality of components of an electronic system; processing the plurality of time sequences of data to generate one or more predictive outputs indicating whether any of the plurality of components will be in an abnormal condition; determining, based on a first predictive output of the one or more predictive outputs, that at least a first component of the plurality of components will be in an abnormal condition at a predicted time; determining one or more mitigation actions to mitigate occurrence of the abnormal condition for the first component; and performing the one or more mitigation actions prior to the predicted time.
  2. 2 . The method of claim 1 , wherein: the plurality of time sequences of data associated with the plurality of components of the electronic system comprises a first time sequence of data associated with a first component of the electronic system and a second time sequence of data associated with a second component of the electronic system, the first and second components being of different types, the first time sequence of data associated with the first component is sampled at a first sampling frequency, and the second time sequence of data associated with the second component is sampled at a second sampling frequency, the second sampling frequency being different from the first sampling frequency.
  3. 3 . The method of claim 1 , wherein the plurality of components comprise one or more hardware components, the one or more hardware components comprising a central processing unit (CPU), a storage device, and/or a peripheral component interconnect express (PCIe) device, and wherein acquiring the plurality of time sequences of data associated with the plurality of components comprises: acquiring one or more time sequences of CPU telemetry data associated with the CPU, acquiring one or more time sequences of storage telemetry data associated with the storage device, and/or acquiring one or more time sequences of PCIe telemetry data associated with the PCIe device.
  4. 4 . The method of claim 3 , wherein acquiring one or more time sequences of CPU telemetry data comprises: acquiring a first time sequence of CPU telemetry data from a machine check architecture (MCA) bank of the CPU that comprises one or more model-specific registers (MSRs), the first time sequence of CPU telemetry data comprising a series of corrected errors associated the CPU or a series of uncorrected errors associated with the CPU, acquiring a second time sequence of CPU telemetry data from one or more error count registers of the CPU, the second time sequence of CPU telemetry data comprising a series of error counts associated with the CPU, the series of error counts comprising a first series of a total number of errors corrected for a memory controller of the CPU that couples the CPU with a memory and/or a second series of a total number of errors corrected for a QuickPath Interconnect (QPI) or an Ultra Path Interconnect (UPI) that couples the CPU with an additional CPU, acquiring a third time sequence of CPU telemetry data associated with the CPU, the third time sequence of CPU telemetry data being collected using a thermal sensor and comprising a series of temperatures values associated with a temperature of the CPU, and/or acquiring a fourth time sequence of CPU telemetry data associated with the CPU from the one or more MSRs or from power management firmware of the CPU, the fourth time sequence of CPU telemetry data comprising a series of current operating frequencies, voltages, or power consumptions, associated with the CPU, wherein the first time sequence, the second time sequence, the third time sequence, and the fourth time sequence, of CPU telemetry data, correspond to a same time period or correspond to different time periods.
  5. 5 . The method of claim 3 , wherein acquiring one or more time sequences of storage telemetry data comprises: acquiring a first time sequence of storage telemetry data reflecting a variation in a percentage of available space for the storage device, acquiring a second time sequence of storage telemetry data reflecting a variation in a total number of media and data integrity errors detected for the storage device, acquiring a third time sequence of storage telemetry data reflecting a variation in a percentage of a life used for the data storage, acquiring a fourth time sequence of storage telemetry data reflecting a variation in a temperature of the data storage, and/or acquiring a fifth time sequence of storage telemetry data reflecting a variation in a critical warning for a state of the data storage, wherein the first time sequence, the second time sequence, the third time sequence, the fourth time sequence, and the fifth time sequence, of storage telemetry data, correspond to a same time period or correspond to different time periods.
  6. 6 . The method of claim 3 , wherein acquiring one or more time sequences of PCIe telemetry data comprises: acquiring a first time sequence of PCIe telemetry data reflecting a series of corrected errors associated with the PCIe device, acquiring a second time sequence of PCIe telemetry data reflecting a series of uncorrected errors associated with the PCIe device, acquiring a third time sequence of PCIe telemetry data reflecting a variation in a link speed of a PCIe link connected to the PCIe device, and/or acquiring a fourth time sequence of PCIe telemetry data reflecting a variation in a bandwidth of a PCIe link connected to the PCIe device, wherein the first time sequence, the second time sequence, the third time sequence, and the fourth time sequence, of PCIe telemetry data, correspond to a same time period or correspond to different time periods.
  7. 7 . The method of claim 1 , wherein processing the plurality of time sequences of telemetry data, to generate one or more predictive outputs comprises: processing a respective time sequence, of the plurality of time sequences of telemetry data, using an exponentially weighted moving average (EWMA) approach or a simple moving average (SMA) approach, to determine a trend of telemetry data variation for the respective time sequence.
  8. 8 . The method of claim 1 , wherein the first component is a CPU, and wherein determining one or more mitigation actions to mitigate the predicted abnormal condition of the first component comprises: generating a high priority hardware error signal to a BMC or to an operating system, and transmitting the high priority hardware error signal to the BMC or to the operating system.
  9. 9 . The method of claim 1 , wherein the first component is a CPU, and wherein determining one or more mitigation actions to mitigate the predicted abnormal condition of the first component comprises: flagging the CPU such that operation of the CPU is prohibited next time the electronic system is initiated.
  10. 10 . The method of claim 1 , wherein the first component is a storage device or a PCIe device, and wherein determining one or more mitigation actions to mitigate the predicted abnormal condition of the first component comprises: prohibiting loading of a UEFI driver for the storage device or the PCIe device.
  11. 11 . The method of claim 1 , wherein the first component is a storage device or a PCIe device, and wherein determining one or more mitigation actions to mitigate the predicted abnormal condition of the first component comprises: flagging the storage device or the PCIe device as “disabled” or “non present” in an ACPI table, to prevent an operating system from attempting to access the storage device or the PCIe device.
  12. 12 . The method of claim 1 , wherein the plurality of components comprise a power supply unit (PSU), a system fan, or a voltage regulator module (VRM).
  13. 13 . The method of claim 12 , wherein processing the plurality of time sequences of data, to generate one or more predictive outputs comprises: processing the plurality of time sequences of data, using one or more machine learning (ML) models, to generate the one or more predictive outputs.
  14. 14 . The method of claim 12 , wherein acquiring the plurality of time sequences of data comprises: acquiring one or more time sequences of PSU telemetry data associated with the PSU, acquiring one or more time sequences of fan telemetry data associated with a fan, and/or acquiring one or more time sequences of VRM telemetry data associated with the VRM.
  15. 15 . The method of claim 14 , wherein acquiring the one or more time sequences of PSU telemetry data comprises: acquiring a first time sequence of PSU telemetry data reflecting a variation in a value for an electrical parameter of the PSU, the electrical parameter being an input voltage, an output voltage, a current, or a power, acquiring a second time sequence of PSU telemetry data reflecting a variation in a value for a temperature of the PSU, acquiring a third time sequence of PSU telemetry data reflecting a variation in a status of a power of the PSU, acquiring a fourth time sequence of PSU telemetry data reflecting a variation in a malfunction indicator of the PSU, and/or acquiring a fifth time sequence of PSU telemetry data reflecting a variation in warnings of the PSU, wherein the first time sequence, the second time sequence, the third time sequence, the fourth time sequence, and the fifth time sequence, of PSU telemetry data, correspond to a same time period or correspond to different time periods.
  16. 16 . The method of claim 14 , wherein acquiring the one or more time sequences of fan telemetry data comprises: acquiring a first time sequence of fan telemetry data reflecting a variation in a Revolutions Per Minute (RPM) of the fan, acquiring a second time sequence of fan telemetry data reflecting a variation in duty cycle associated with a Pulse Width Modulation (PWM) controlling signal that is transmitted to a fan, acquiring a third time sequence of fan telemetry data reflecting a variation in an operating status of the fan, acquiring a fourth time sequence of fan telemetry data reflecting a variation in a current of the fan, and/or acquiring a fifth time sequence of fan telemetry data reflecting a variation in a power of the fan, wherein the first time sequence, the second time sequence, the third time sequence, the fourth time sequence, and the fifth time sequence, of fan telemetry data, correspond to a same time period or correspond to different time periods.
  17. 17 . The method of claim 14 , wherein acquiring the one or more time sequences of VRM telemetry data comprises: acquiring a first time sequence of VRM telemetry data from a temperature sensor within a VRM region in proximity to one or more core hardware components of the electronic system, the first time sequence of VRM telemetry data reflecting a variation in a value of a temperature associated with the VRM, acquiring a second time sequence of VRM telemetry data reflecting a variation in an output voltage associated with the VRM, acquiring a third time sequence of VRM telemetry data reflecting a variation in an output current associated with the VRM, and/or acquiring a fourth time sequence of VRM telemetry data reflecting a variation in a phase health condition associated with the VRM, wherein the first time sequence, the second time sequence, the third time sequence, and the fourth time sequence, of VRM telemetry data, correspond to a same time period or correspond to different time periods.
  18. 18 . A method, comprising: acquiring a plurality of time sequences of data associated with a plurality of components of a computing system; processing the plurality of time sequences of data, to generate one or more predictive outputs indicating whether any of the plurality of components will be in an abnormal condition; determining, based on a first predictive output of the generated one or more predictive outputs, that a first component will be in an abnormal condition; determining one or more mitigation actions to prevent the first component from entering the abnormal condition; and performing the one or more mitigation actions.
  19. 19 . The method of claim 18 , further comprising: generating an entry in a database to record the first predictive output and/or the one or more mitigation actions determined for the first component.
  20. 20 . A system, comprising: at least one processor, and memory storing instructions that, when executed by the at least one processor, cause the at least one processor to: acquire a plurality of time sequences of data associated with a plurality of hardware components of a server system; process the plurality of time sequences of telemetry data, to generate one or more predictive outputs predicting whether any of the plurality of hardware components is to be in an abnormal condition; determine, based on a first predictive output of the generated one or more predictive outputs, that a first hardware component will be in an abnormal condition; determine one or more mitigation actions to prevent the first hardware component from being in the abnormal condition; and perform the one or more mitigation actions.

Description

TECHNICAL FIELD The present disclosure relates generally to systems and methods for monitoring or predicting health condition(s) of multiple hardware components. BACKGROUND With the rapid development in technologies such as cloud computing, big data, and artificial intelligence, the scale and complexity of data centers continue to increase, and there is a high demand on computing systems, such as server system(s). The reliability, availability, and serviceability (RAS) of a server system are key factors for evaluating a performance of the server system. Hardware malfunction is a typical reason for outage or service interruption of a computing system, e.g., server system. Traditional approaches to handling server malfunction are usually passive, namely, to perform diagnosis and repair after the occurrence of a malfunction. These approaches are not only time-consuming, but also possibly result in severe data loss and economic damages. As a result, there is a need to develop methods and systems for proactively predicting and/or preventing occurrence of hardware malfunction. SUMMARY Techniques are described herein for monitoring and/or predicting health condition(s) for a computing system (or an electronic system) that includes multiple components (e.g., CPU, storage device, fan, power supply unit, etc.). According to one aspect of the present disclosure, a method is provided. In various embodiments, the method includes: acquiring a plurality of time sequences of data (e.g., telemetry data) associated with a plurality of hardware components of an electronic system; processing the plurality of time sequences of data (e.g., telemetry data) to generate one or more predictive outputs indicating whether any of the plurality of hardware components will be in an abnormal condition (e.g., malfunction); determining or predicting, based on a first predictive output of the one or more predictive outputs, that at least a first hardware component of the plurality of hardware component will be in an abnormal condition (e.g., malfunction) at a predicted time; determining one or more mitigation actions to mitigate the abnormal condition (e.g., malfunction) of the first hardware component; and performing the one or more mitigation actions prior to the predicted time. In some embodiments, the plurality of hardware components includes one or more of: a central processing unit (CPU), a storage device, and a peripheral component interconnect express (PCIe) device. In some embodiments, acquiring the plurality of time sequences of telemetry data associated with the plurality of hardware components comprises: acquiring one or more time sequences of CPU telemetry data associated with the CPU; acquiring one or more time sequences of storage telemetry data associated with the storage device; and/or acquiring one or more time sequences of PCIe telemetry data associated with the PCIe device. In some embodiments, acquiring one or more time sequences of CPU telemetry data includes: acquiring a first time sequence of CPU telemetry data from a machine check architecture (MCA) bank of the CPU that comprises one or more model-specific registers (MSRs), where the first time sequence of CPU telemetry data includes a series of corrected errors associated the CPU or a series of uncorrected errors associated with the CPU. In some embodiments, additionally, or alternatively, acquiring one or more time sequences of CPU telemetry data includes: acquiring a second time sequence of CPU telemetry data from one or more error count registers of the CPU, where the second time sequence of CPU telemetry data includes a series of error counts associated with the CPU. In some embodiments, the series of error counts includes a first series of a total number of errors corrected for a memory controller of the CPU that couples the CPU with a memory and/or a second series of a total number of errors corrected for a QuickPath Interconnect (QPI) or an Ultra Path Interconnect (UPI) that couples the CPU with an additional CPU. In some embodiments, additionally, or alternatively, acquiring one or more time sequences of CPU telemetry data includes: acquiring a third time sequence of CPU telemetry data associated with the CPU, where the third time sequence of CPU telemetry data is collected using a thermal sensor and comprising a series of temperatures values associated with a temperature of the CPU. In some embodiments, additionally, or alternatively, acquiring one or more time sequences of CPU telemetry data includes: acquiring a fourth time sequence of CPU telemetry data associated with the CPU from the one or more MSRs or from power management firmware of the CPU. In some embodiments, the fourth time sequence of CPU telemetry data includes a series of current operating frequencies, voltages, or power consumptions, associated with the CPU. In some embodiments, the first time sequence, the second time sequence, the third time sequence, and the fourth time sequence, of CPU