US-20260127064-A1 - BMC/BIOS RAS SYSTEM

US20260127064A1US 20260127064 A1US20260127064 A1US 20260127064A1US-20260127064-A1

Abstract

A BMC/BIOS RAS system includes a computing device including a processing system coupled to a BMC device and a computing component. A host processing device in the processing system identifies that the computing component has reached a first error threshold and, in response, notifies the BMC device. In response to being notified, the BMC device identifies and stores error telemetry data associated with at least one error that occurred in the computing component. One of the host processing device or the BMC device identifies that the computing component has reached a second error threshold and, in response, notifies a BIOS included in the processing system. In response to being notified the BIOS identifies and logs that the computing component has reached the second error threshold.

Inventors

Wei Liu
Ching-Lung Chao

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . A Baseboard Management Controller (BMC)/Basic Input/Output System (BIOS) Reliability, Availability, and Serviceability (RAS) system, comprising: a computing device; a computing component included in the computing device; a Baseboard Management Controller (BMC) device included in the computing device; and a processing system that is included in the computing device and coupled to the computing component and the BMC device, wherein the processing system includes a host processing device that is configured to: identify that the computing component has reached a first error threshold; and notify, in response to identifying that the computing component has reached the first error threshold, the BMC device, wherein the BMC device is configured to: identify and store, in response to being notified, error telemetry data associated with at least one error that occurred in the computing component, and wherein one of the host processing device or the BMC device is configured to: identify that the computing component has reached a second error threshold; and notify, in response to identifying that the computing component has reached the second error threshold, a Basic Input/Output System (BIOS) that is included in the processing system, wherein the BIOS is configured to: identify and log, in response to being notified, that the computing component has reached the second error threshold.
2 . The system of claim 1 , wherein the first error threshold is different than the second error threshold.
3 . The system of claim 2 , wherein the first error threshold is one error, and wherein the second error threshold is a plurality of errors.
4 . The system of claim 1 , wherein the computing component is a memory device.
5 . The system of claim 1 , wherein the BIOS is configured to: transmit, in response to being notified, an error warning message to the host processing device, and wherein the host processing device is configured to: perform, in response to receiving the error warning message, an error threshold action.
6 . The system of claim 1 , wherein the error telemetry data identifies a location of the at least one error that occurred in the computing component.
7 . An Information Handling System (IHS), comprising: a processing system including a host processing device and a Basic Input/Output System (BIOS); and a memory system that is coupled to the processing system and that includes instructions that, when executed by the host processing device, cause the host processing device to provide a host processing engine that is configured to: identify that a computing component that is coupled to the processing system has reached a first error threshold; notify, in response to identifying that the computing component has reached the first error threshold, a Baseboard Management Controller (BMC) device that is coupled to the processing system to cause the BMC device to identify and store error telemetry data associated with at least one error that occurred in the computing component; identify that the computing component has reached a second error threshold; and notify, in response to identifying that the computing component has reached the second error threshold, the BIOS, wherein memory system includes instructions that, when executed by the BIOS, cause the BIOS to provide a BIOS engine that is configured to: identify and log, in response to being notified, that the computing component has reached the second error threshold.
8 . The IHS of claim 7 , wherein the first error threshold is different than the second error threshold.
9 . The IHS of claim 8 , wherein the first error threshold is one error, and wherein the second error threshold is a plurality of errors.
10 . The IHS of claim 7 , wherein the computing component is a memory device.
11 . The IHS of claim 7 , wherein the BIOS engine is configured to: transmit, in response to being notified, an error warning message to the host processing engine, and wherein the host processing engine is configured to: perform, in response to receiving the error warning message, an error threshold action.
12 . The IHS of claim 11 , wherein the error threshold action includes at least one of: configuring a backup component for the computing component; and preventing, via an operating system provided by the host processing device, access to at least a portion of the computing component that is associated with the error.
13 . The IHS of claim 7 , wherein the error telemetry data identifies a location of the at least one error that occurred in the computing component.
14 . A method for performing Reliability, Availability, and Serviceability (RAS) operations using a Baseboard Management Controller (BMC) and a Basic Input/Output System (BIOS) in a computing device, comprising: identifying, by a host processing device in a processing system, that a computing component that is coupled to the processing system has reached a first error threshold; notifying, by the host processing device in response to identifying that the computing component has reached the first error threshold, a Baseboard Management Controller (BMC) device that is coupled to the processing system; identifying and storing, by the BMC device, error telemetry data associated with at least one error that occurred in the computing component; identifying, by the one of the host processing device or the BMC device, that the computing component has reached a second error threshold; notifying, by one of the host processing device or the BMC device in response to identifying that the computing component has reached the second error threshold, the BIOS; and identifying and logging, by the BIOS in response to being notified, that the computing component has reached the second error threshold.
15 . The method of claim 14 , wherein the first error threshold is different than the second error threshold.
16 . The method of claim 15 , wherein the first error threshold is one error, and wherein the second error threshold is a plurality of errors.
17 . The method of claim 14 , wherein the computing component is a memory device.
18 . The method of claim 14 , further comprising: transmitting, by the BIOS in response to being notified, an error warning message to the host processing device; and performing, by the host processing device in response to receiving the error warning message, an error threshold action.
19 . The method of claim 14 , wherein the error threshold action includes at least one of: configuring a backup component for the computing component; and preventing, via an operating system provided by the host processing device, access to at least a portion of the computing component that is associated with the error.
20 . The method of claim 14 , wherein the error telemetry data identifies a location of the at least one error that occurred in the computing component.

Description

BACKGROUND The present disclosure relates generally to information handling systems, and more particularly to using a Baseboard Management Controller (BMC) device and a Basic Input/Output System (BIOS) to perform Reliability, Availability and Serviceability (RAS) operations for an information handling system. As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems. Information handling systems such as, for example, server devices, networking devices (e.g., switch devices), storage systems, and/or other computing devices known in the art, often perform Reliability, Availability, and Serviceability (RAS) operations that generally provide for the monitoring of the operation of their components to identify errors and other issues with those components and generate logs for those errors and other issues, which may allow for the analysis of failures or other unavailability of those components, as well as the prediction or other determination of future unavailability of those components. For example, consider conventional RAS operations performed for a memory device in a server device. In the event a correctable error occurs in the memory device, that correctable error may be identified by the Central Processing Unit (CPU) in the server device and, in response, the CPU may generate an System Management Interrupt (SMI) to pass control of the server device to a Basic Input/Output System (BIOS) in the server device. In response to the SMI, the BIOS may identify the correctable error that occurred in the memory device, log the occurrence of that correctable error in a Baseboard Management Controller (BMC) in the server device, and in some situations transmit a corresponding error warning message to the CPU that may cause the CPU to perform an error action (e.g., performing memory mirroring or memory sparing), and/or transmit a corresponding error warning message to an operating system provided by the server device that may cause the operating system to perform an error action (e.g., performing page off-lining or Post Package Repair (PPR)). As will be appreciated by one of skill in the art in possession of the present disclosure, in order to generate a relatively complete record of correctable errors that occur in memory devices, as well as perform appropriate error actions in response to one or more correctable errors, such conventional RAS operations require many SMIs. As will be appreciated by one of skill in the art in possession of the present disclosure, the SMI provisioning operations by the CPU discussed above have an associated latency that results from the need for the CPU to prepare to enter a System Management Mode (SMM) in which the BIOS may perform the operations described above, enter that SMM, and then exit that SMM once the BIOS has completed the operations described above, and such latency scales with the number of processor cores in the CPU that must each enter and exit the SMM. Furthermore, while the CPU is in the SMM, operating system runtime issues can occur such as, for example, network packet losses, Watch Dog Timer (WDT) timeouts, and/or other SMM issues that would be apparent to one of skill in the art in possession of the present disclosure. Further still, a BIOS SMI handler in the BIOS that handles the SMIs discussed above is relatively complex piece of code, and the SMM described above is a relatively privileged mode of the computing device that can raise security concerns, as malicious parties can (and have) exploited the ability to trigger SMIs and enter the SMM to gain unauthorized access to critical components in the computing device. A solution to the issues described above has been proposed, and provides for the handling of RAS ope