US-20260126839-A1 - REDUCED VOLTAGE REGULATOR MODULE PHASES
Abstract
A method, according to one approach, includes: detecting a failed phase of a first VRM which causes the first VRM to have a number of functioning phases that is outside a predetermined range. The first VRM is included in a processor system architecture having a plurality of VRMs respectively associated with a plurality of chips. The method also includes causing any workloads on a first chip associated with the first VRM to be offloaded. Moreover, a controlled shutdown of the first VRM and the first chip is performed, while the plurality of VRMs respectively associated with the plurality of chips in the processor system architecture, excluding the first VRM and the first chip, remain operational.
Inventors
- Justin Henspeter
- Eric Jason Fluhr
- Gregory Scott Still
- Eric Marz
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260507
- Application Date
- 20241101
Claims (20)
- 1 . A method comprising: in a processor system architecture having a plurality of voltage regulator modules (VRMs) respectively associated with a plurality of chips, detecting a failed phase of a first VRM which causes the first VRM to have a number of functioning phases that is outside a predetermined range; causing any workloads on a first chip associated with the first VRM to be offloaded; and causing a controlled shutdown of the first VRM and the first chip to be performed, wherein the plurality of VRMs respectively associated with the plurality of chips in the processor system architecture, excluding the first VRM and the first chip, remain operational.
- 2 . The method of claim 1 , wherein the predetermined range is based at least in part on an amount of the functioning phases associated with satisfying a performance standard of the first chip associated with the first VRM.
- 3 . The method of claim 1 , further comprising: transmitting a notification indicating the first VRM has the number of functioning phases that is outside the predetermined range.
- 4 . The method of claim 1 , wherein the processor system architecture includes at least eight VRMs that are respectively associated with at least eight chips.
- 5 . The method of claim 1 , further comprising: monitoring a current supplied by the respective VRMs; and in response to the supply current provided by the first VRM being outside a second predetermined range, detecting the first VRM as having a number of functioning phases that is outside the predetermined range.
- 6 . The method of claim 1 , wherein the causing the controlled shutdown of the first VRM and the first chip to be performed includes: in response to the workloads being offloaded from the first chip, causing an output voltage of the first VRM to be disabled.
- 7 . The method of claim 6 , wherein the controlled shutdown of the first VRM and the first chip is performed without firmware intervention.
- 8 . The method of claim 1 , wherein the predetermined range includes four or more phases.
- 9 . A computer program product comprising: one or more computer-readable storage media; and program instructions stored on the one or more storage media to perform operations comprising: in a processor system architecture having a plurality of voltage regulator modules (VRMs) respectively associated with a plurality of chips, detecting a failed phase of a first VRM which causes the first VRM to have a number of functioning phases that is outside a predetermined range; causing any workloads on a first chip associated with the first VRM to be offloaded; and causing a controlled shutdown of the first VRM and the first chip to be performed, wherein the plurality of VRMs respectively associated with the plurality of chips in the processor system architecture, excluding the first VRM and the first chip, remain operational.
- 10 . The computer program product of claim 9 , wherein the predetermined range is based at least in part on an amount of the functioning phases associated with satisfying a performance standard of the first chip associated with the first VRM.
- 11 . The computer program product of claim 9 , wherein the operations further comprise: transmitting a notification indicating the first VRM has the number of functioning phases that is outside the predetermined range.
- 12 . The computer program product of claim 9 , wherein the processor system architecture includes at least eight VRMs that are respectively associated with at least eight chips.
- 13 . The computer program product of claim 9 , wherein the operations further comprise: monitoring a current supplied by the respective VRMs; and in response to the supply current provided by the first VRM being outside a second predetermined range, detecting the first VRM as having a number of functioning phases that is outside the predetermined range.
- 14 . The computer program product of claim 9 , wherein the causing the controlled shutdown of the first VRM and the first chip to be performed includes: in response to the workloads being offloaded from the first chip, causing an output voltage of the first VRM to be disabled.
- 15 . The computer program product of claim 14 , wherein the controlled shutdown of the first VRM and the first chip is performed without firmware intervention.
- 16 . The computer program product of claim 9 , wherein the predetermined range includes four or more phases.
- 17 . A computer system comprising: a processor set having an architecture which includes a plurality of voltage regulator modules (VRMs) respectively associated with a plurality of chips; one or more computer-readable storage media; and program instructions stored on the one or more storage media to cause the processor set to perform operations comprising: detecting a failed phase of a first VRM which causes the first VRM to have a number of functioning phases that is outside a predetermined range; causing any workloads on a first chip associated with the first VRM to be offloaded; and causing a controlled shutdown of the first VRM and the first chip to be performed, wherein the plurality of VRMs respectively associated with the plurality of chips in the processor set architecture, excluding the first VRM and the first chip, remain operational.
- 18 . The computer system of claim 17 , wherein the operations further comprise: monitoring a current supplied by the respective VRMs; and in response to the supply current provided by the first VRM being outside a second predetermined range, detecting the first VRM as having a number of functioning phases that is outside the predetermined range.
- 19 . The computer system of claim 17 , wherein the predetermined range is based at least in part on an amount of the functioning phases associated with satisfying a performance standard of the first chip associated with the first VRM.
- 20 . The computer system of claim 17 , wherein the controlled shutdown of the first VRM and the first chip is performed without firmware intervention.
Description
BACKGROUND The present invention relates to supply voltages, and more specifically, this invention relates to power supply voltages. Pluggable voltage regulator modules (VRMs) are generally used to deliver voltage and current to subsystems. For instance, VRMs are used to supply voltage and/or current to sub-systems in servers. When voltage and/or current availability is of particular importance, pluggable VRMs can be designed to be phase redundant. Phase redundancy allows for one or more phases (e.g., power stages) to fail, while seamlessly isolating the failed phase(s) from adjacent (e.g., parallel) phases and allowing the system to continue to operate without fault. VRM designs consider the minimum number of phases (N) that are associated with supporting a given application under the “worst-case” loading conditions. An additional number of “redundant” phases may also be added to the design for resilience in failure conditions. For example, in a traditional designed VRM, 2 phases can be lost while still supporting the worst-case loading conditions for the underlying system. As supply voltages become more complex, the number of redundant phases associated with maintaining similar resilience increases as well. SUMMARY A method, according to one approach, includes: detecting a failed phase of a first VRM which causes the first VRM to have a number of functioning phases that is outside a predetermined range. The first VRM is included in a processor system architecture having a plurality of VRMs respectively associated with a plurality of chips. The method also includes causing any workloads on a first chip associated with the first VRM to be offloaded. Moreover, a controlled shutdown of the first VRM and the first chip is performed, while the plurality of VRMs respectively associated with the plurality of chips in the processor system architecture, excluding the first VRM and the first chip, remain operational. A computer program product, according to another approach, includes: one or more computer-readable storage media, and program instructions that are stored on the one or more storage media to perform the foregoing method. A computer system, according to yet another approach, includes: a processor set having an architecture which includes a plurality of VRMs respectively associated with a plurality of chips. The computer system also includes one or more computer-readable storage media, along with program instructions stored on the one or more storage media to cause the processor set to perform the foregoing method. Other aspects and implementations of the present invention will become apparent from the following detailed description, which, when taken in conjunction with the drawings, illustrate by way of example the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of a computing environment, in accordance with one approach. FIG. 2A is a representational view of a processor system, in accordance with one approach. FIG. 2B is a representational view of a progression, in accordance with one approach. FIG. 3 is a flowchart of a method, in accordance with one approach. DETAILED DESCRIPTION The following description is made for the purpose of illustrating the general principles of the present invention and is not meant to limit the inventive concepts claimed herein. Further, particular features described herein can be used in combination with other described features in each of the various possible combinations and permutations. Unless otherwise specifically defined herein, all terms are to be given their broadest possible interpretation including meanings implied from the specification as well as meanings understood by those skilled in the art and/or as defined in dictionaries, treatises, etc. It must also be noted that, as used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless otherwise specified. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The following description discloses several preferred approaches of systems, methods and computer program products for reducing the number of redundant supply voltage phases associated with maintaining operation of underlying chip modules. Approaches herein are able to achieve this by separating the supply voltages such that each chip is provided with a different (e.g., individual) power supply voltage. As a result, a single chip may be taken offline in a controlled manner without impacting performance of the remaining chips in the processor system. Approaches herein are thereby able to achieve a unique power delivery configuration that differs from conventional