CN-122003668-A - Individual power cycle control of accelerator modules configured on nodes

CN122003668ACN 122003668 ACN122003668 ACN 122003668ACN-122003668-A

Abstract

Disclosed herein is a system for implementing a management controller on a node or network server that is dedicated to monitoring individual health of a plurality of accelerator modules configured on the node. Based on the monitored health, the management controller is configured to implement autonomous power cycle control of the individual accelerator modules. The autonomous power cycle control may be implemented without violating requirements of standards established for the accelerator module (e.g., open computing project requirements, peripheral component interconnect express (PCIe) interface requirements).

Inventors

S. C. Pawa
K.LI
K. Kothari
S. S. Labor Deshpande

Assignees

微软技术许可有限责任公司

Dates

Publication Date: 20260508
Application Date: 20241110
Priority Date: 20231214

Claims (20)

1. A method implemented by a management controller (104), the management controller (104) configured to monitor health of individual ones of a plurality of accelerator modules (106 (1 to N)) configured on a node (102), the method comprising: -receiving (502) a first signal (204) identifying the individual accelerator module (106 (1)); Determining (504) that the first signal indicates an error (126) affecting the individual accelerator module; In response to determining that the first signal indicates the error affecting the individual accelerator module, sending (506) a second signal (302) to a general purpose input output expander on an I2C bus (304) to power cycle the individual accelerator module without power cycling other accelerator modules of the plurality of accelerator modules, and The general purpose input output expander is instructed (508) to generate an asynchronous power cycle signal (308) via a switching or gate (310 (1)) of a multiplexer/demultiplexer (306) and to send the asynchronous power cycle signal (308) to the individual accelerator module but not to the other accelerator modules.
2. The method of claim 1, wherein determining that the signal indicates the error affecting the individual accelerator module is received based on the signal from a general purpose input output pin of the individual accelerator module, the general purpose input output pin being dedicated to error signaling for power cycling purposes.
3. The method of claim 1 or claim 2, wherein the switching or gate disconnects power to the individual accelerator module and then reconnects the power to the individual accelerator module without disconnecting power to the other accelerator modules or the node.
4. The method of any one of claims 1 to 3, further comprising informing an operating system and the other accelerator modules of the error affecting the individual accelerator module via a system management bus to ensure that the operating system and the other accelerator modules do not send traffic to or interact with the individual accelerator module.
5. A baseboard management controller (104), the baseboard management controller (104) configured to monitor health of individual ones of a plurality of accelerator modules (106 (1 to N)) configured on a node (102) by performing operations comprising: -receiving (502) a first signal (204) identifying the individual accelerator module (106 (1)); Determining (504) that the first signal indicates an error (126) affecting the individual accelerator module, and In response to determining that the first signal indicates the error affecting the individual accelerator module, a second signal (302) is sent (506) to power cycle the individual accelerator module without power cycling other accelerator modules of the plurality of accelerator modules.
6. The baseboard management controller of claim 5, wherein determining that the signal indicates the error affecting the individual accelerator module is received based on the signal from a general purpose input output pin of the individual accelerator module, the general purpose input output pin being dedicated to error signaling for power cycling purposes.
7. The baseboard management controller of claim 5 or claim 6, wherein: The second signal is sent to a general input/output expander on the I2C bus, and The operations further include instructing the general purpose input output expander to generate an asynchronous power cycle signal via a switching or gate of a multiplexer/demultiplexer and to send the asynchronous power cycle signal to the individual accelerator module instead of the other accelerator modules.
8. The baseboard management controller of claim 7, wherein the switch or gate disconnects power to the individual accelerator module and then reconnects the power to the individual accelerator module without disconnecting power to the other accelerator modules or the node.
9. The baseboard management controller of any one of claims 5 to 8, wherein the first signal is received via a system management bus or an I2C bus.
10. The baseboard management controller of any one of claims 5 to 9, wherein the operations further comprise notifying an operating system and the other accelerator modules of the error affecting the individual accelerator module via a system management bus to ensure that the operating system and the other accelerator modules do not send traffic to or interact with the individual accelerator module.
11. The baseboard management controller of any one of claims 5 to 10, wherein: the node includes a plurality of central processing units, and Each of the plurality of central processing units is coupled to a plurality of accelerator modules via a printed circuit board.
12. A method implemented by a management controller (104), the management controller (104) configured to monitor health of individual ones of a plurality of accelerator modules (106 (1 to N)) configured on a node (102), the method comprising: -receiving (502) a first signal (204) identifying the individual accelerator module (106 (1)); Determining (504) that the first signal indicates an error (126) affecting the individual accelerator module, and In response to determining that the first signal indicates the error affecting the individual accelerator module, a second signal (302) is sent (506) to power cycle the individual accelerator module without power cycling other accelerator modules of the plurality of accelerator modules.
13. The method of claim 12, wherein determining that the signal indicates the error affecting the individual accelerator module is received based on the signal from a general purpose input output pin of the individual accelerator module, the general purpose input output pin being dedicated to error signaling for power cycling purposes.
14. The method of claim 12 or claim 13, wherein: The second signal is sent to a general input/output expander on the I2C bus, and The method further includes instructing the general purpose input output expander to generate an asynchronous power cycle signal via a switching or gate of a multiplexer/demultiplexer and to send the asynchronous power cycle signal to the individual accelerator module instead of the other accelerator modules.
15. The method of claim 14, wherein the switching or gate disconnects power to the individual accelerator module and then reconnects the power to the individual accelerator module without disconnecting power to the other accelerator modules or the node.
16. The method of any of claims 12 to 15, wherein the first signal is received via a system management bus or an I2C bus.
17. The method of any one of claims 12 to 16, further comprising informing an operating system and the other accelerator modules of the error affecting the individual accelerator module via a system management bus to ensure that the operating system and the other accelerator modules do not send traffic to or interact with the individual accelerator module.
18. The method of any one of claims 12 to 17, wherein the management controller comprises a baseboard management controller configured on a printed circuit board and the plurality of accelerator modules.
19. The method of any of claims 12 to 18, wherein the plurality of accelerator modules comprises a plurality of graphics processing units.
20. The method of any one of claims 12 to 19, wherein: the node includes a plurality of central processing units, and Each of the plurality of central processing units is coupled to a plurality of accelerator modules via a printed circuit board.

Description

Individual power cycle control of accelerator modules configured on nodes Background Many different types of physical computing devices, such as web servers, are configured with a plurality of Central Processing Units (CPUs) (alternatively referred to as processing cores). A physical device, such as a network server, is referred to herein as a node. To enhance the performance of applications and/or virtual resources (e.g., virtual machines, containers) on a node, a CPU may be associated with multiple accelerator modules (alternatively referred to as coprocessors). Examples of accelerator modules include Graphics Processing Units (GPUs) and network accelerators. Unfortunately, when an individual accelerator module encounters an error, the standardized approach to resolving the error requires the entire node to power cycle. For example, when an individual accelerator module (e.g., an open computing item accelerator module (OAM)) configured according to an open computing item enters a kernel crash or suspended state, the entire node requires a power cycling event, such as a reboot, that initiates a reboot procedure by powering down the node and powering back up the node via an Alternating Current (AC) cycle. Open computing projects are organizations in which different manufacturers and/or vendors cooperate and share technologies related to accelerator modules to achieve compatibility and extensibility. In another example, according to the standard, an accelerator module connected via a peripheral component interconnect express (PCIe) interface is not allowed to implement power cycling events (e.g., function level reboot, reset). Thus, when an individual accelerator module connected via a PCIe interface enters a kernel crash or suspended state, a restart flow needs to be initiated by powering down and powering back up the node, thereby performing a power cycle event such as a restart for the entire node. In other words, the affected accelerator module is not hot pluggable (i.e., the node cannot power cycle only the affected accelerator module). The power cycling event for the entire node may take up to 30 minutes to complete. During this time, all the CPUs and/or accelerator modules (even those that are not affected by the error) configured on the node cannot execute applications and/or virtual resources belonging to the cloud tenant and/or cloud management process. Instead, this period of time is to turn off the power to the node, wait a few minutes for the hardware components of the node to cool down, reconnect the power to the node, implement a restart flow for the operating system, and then implement a restart flow for the CPU and accelerator modules that previously executed the application and/or virtual resource (which was interrupted when the power was turned off). Power cycling the entire node can negatively impact applications and/or virtual resources executing on the unaffected accelerator modules. As a result, multiple cloud tenants are often affected by errors, rather than one cloud tenant (e.g., customer) being affected by errors, increasing the annual outage rate (AIR). AIR is an index that is closely tracked by the cloud platform and cloud platform tenants for quality and service purposes. For example, increased AIR typically translates into revenue loss due to longer downtime, longer debug cycles, unavailable virtual machines, and the like. Disclosure of Invention The technology disclosed herein introduces a management controller on a node or network server that is dedicated to monitoring individual health of a plurality of accelerator modules configured on the node. Based on the monitored health, the management controller is configured to implement autonomous power cycle control of the individual accelerator modules. The implementation of the autonomous power cycle control does not violate standard requirements established for the accelerator module (e.g., open computing project requirements, peripheral component interconnect express (PCIe) interface requirements). Because the management controller is dedicated to monitoring the accelerator module, the management controller may be referred to as an Accelerator Management Controller (AMC). The management controller or AMC may be implemented via a Baseboard Management Controller (BMC), discrete modules, or other types of management modules. For example, a BMC is a service processor that is able to monitor the physical state of device memory and other hardware/firmware components (e.g., accelerator modules, such as a Graphics Processing Unit (GPU) or network accelerator) using sensors and/or other mechanisms. The BMC is configured on a printed circuit board (e.g., general purpose substrate, motherboard) of the node and may enable communications associated with monitoring via a shared or dedicated Network Interface Card (NIC). To this end, the management controller described herein is configured to receive a signal indicative of the health of an in