Search

CN-122019283-A - Hardware fault positioning method, server system and server

CN122019283ACN 122019283 ACN122019283 ACN 122019283ACN-122019283-A

Abstract

The application provides a hardware fault positioning method, a server system and a server, and relates to the technical field of servers. According to the hardware fault positioning method, when the BMC monitors that the abnormal hardware functional module exists in the server system, the CPLD acquires hardware signal data corresponding to the abnormal hardware functional module in real time and stores the hardware signal data into the nonvolatile memory chip, and further when the BMC monitors that the abnormal hardware functional module is abnormally reproduced, the hardware signal data corresponding to the abnormal hardware functional module is read from the nonvolatile memory chip, so that the hardware signal data can be acquired without disassembling a machine, the root cause of the hardware fault is positioned based on the hardware signal data, the quick positioning of the root cause of the hardware fault is realized, and the positioning efficiency of the hardware fault is improved.

Inventors

  • Bai erhu
  • ZHAO JI
  • GUO PEIXUAN

Assignees

  • 上海远图未来信息技术有限公司

Dates

Publication Date
20260512
Application Date
20251230

Claims (10)

  1. 1. The utility model provides a hardware fault location method which is characterized in that is applied to the server system, the server system includes baseboard management controller BMC, complex programmable logic device CPLD and nonvolatile memory chip that communication connects in proper order, and the method includes: In response to the BMC monitoring that an abnormal hardware functional module exists in the server system, sending a hardware signal acquisition instruction to the CPLD; the CPLD responds to the hardware signal acquisition instruction, acquires hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal acquisition instruction, and transmits the hardware signal data to the nonvolatile memory chip for storage based on a first communication link; and responding to the BMC monitoring abnormal reproduction of the abnormal hardware functional module, wherein the BMC reads hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on a second communication link, and the hardware signal data is used for positioning a hardware fault root cause of the abnormal hardware functional module.
  2. 2. The hardware fault location method of claim 1, wherein the hardware signal acquisition instruction carries communication link identification information, the method further comprising: and the CPLD responds to the hardware signal acquisition instruction and switches the current communication link into the first communication link for connecting the CPLD and the nonvolatile memory chip according to the communication link identification information.
  3. 3. The method for locating a hardware fault according to claim 1, wherein the reading hardware signal data corresponding to the abnormal hardware function module from the nonvolatile memory chip based on a second communication link in response to the BMC monitoring the abnormal reproduction of the abnormal hardware function module comprises: In response to the BMC monitoring abnormal reproduction of the abnormal hardware functional module, the BMC sends a communication link switching instruction to the CPLD; the CPLD responds to the communication link switching instruction and switches a current communication link into the second communication link for connecting the BMC and the nonvolatile memory chip; and the BMC reads the hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on a second communication link.
  4. 4.A hardware fault locating method according to any one of claims 1 to 3, wherein the hardware signal collection instruction further carries a hardware signal type and a sampling frequency, and the CPLD responds to the hardware signal collection instruction, and collects hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal collection instruction, including: And the CPLD responds to the hardware signal acquisition instruction and acquires hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal type and the sampling frequency.
  5. 5. A hardware fault location method according to any one of claims 1 to 3, wherein the transmitting the hardware signal data to the non-volatile memory chip for storage based on a first communication link comprises: The CPLD transmits the hardware signal data to the nonvolatile memory chip based on the first communication link; and the nonvolatile memory chip stores the hardware signal data in a classified manner according to the region space.
  6. 6. A hardware fault location method according to any one of claims 1 to 3, wherein in response to the BMC monitoring the abnormal hardware functional module abnormal reproduction, further comprising: the BMC sends a hardware signal acquisition stopping instruction to the CPLD; and the CPLD responds to the instruction for stopping acquisition of the hardware signals and stops acquiring the hardware signal data corresponding to the abnormal hardware functional module.
  7. 7. The server system is characterized by comprising a baseboard management controller BMC, a complex programmable logic device CPLD and a nonvolatile memory chip which are sequentially connected in a communication way; the BMC is used for monitoring the running state of the hardware functional module in the server system in real time, and sending a hardware signal acquisition instruction to the CPLD when the existence of the abnormal hardware functional module is monitored; the CPLD is used for responding to the hardware signal acquisition instruction, acquiring hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal acquisition instruction, and transmitting the hardware signal data to the nonvolatile memory chip for storage based on a first communication link; The BMC is further used for reading hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on a second communication link in response to the detection of abnormal reproduction of the abnormal hardware functional module, wherein the hardware signal data is used for positioning a hardware fault root cause of the abnormal hardware functional module.
  8. 8. The server system of claim 7, wherein the CPLD includes a serial peripheral interface, SPI, controller module and a communications switch switching module, the hardware signal acquisition instructions carrying communications link identification information, hardware signal type, and sampling frequency; the SPI controller module is used for acquiring hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal type and the sampling frequency, and transmitting the hardware signal data to the nonvolatile memory chip for storage based on the first communication link; The communication switch switching module is used for switching a current communication link into the first communication link for connecting the SPI controller module and the nonvolatile memory chip according to the communication link identification information.
  9. 9. The server system of claim 8, wherein the CPLD further comprises a random access module, the random access module and the BMC being connected by a third communication link; the random access module is used for receiving the hardware signal acquisition instruction, sending the communication link identification information to the communication switch switching module, and sending the hardware signal type and the sampling frequency to the SPI controller module; The random access module is further configured to receive a communication link switching instruction sent by the BMC based on the third communication link, and send the communication link switching instruction to the communication switch switching module, so that the communication switch switching module is enabled to switch a current communication link to the second communication link for connecting the BMC and the nonvolatile memory chip according to the communication link switching instruction.
  10. 10. The server is characterized by comprising a processor and a memory, wherein the memory is in communication connection with the processor; a processor executes computer-executable instructions stored in a memory to implement the hardware fault localization method of any one of claims 1 to 6.

Description

Hardware fault positioning method, server system and server Technical Field The present application relates to the field of server technologies, and in particular, to a hardware fault positioning method, a server system, and a server. Background In the deployment and maintenance scenarios of server products, especially AI servers, rapid localization and analysis of hardware failures is a core requirement to ensure stable operation of the server system. The current AI server generally adopts a liquid cooling heat dissipation technology to cope with heat dissipation pressure caused by high-density calculation, and the structural complexity of the AI server is obviously higher than that of the traditional server. For example, the main board, the power module, the network interface card (network) and other hardware components of the liquid cooling server need to be integrated in a closed liquid cooling pipeline system, and the collection and fault location of the hardware signals need to depend on a built-in monitoring module (such as a baseboard management controller (Baseboard Management Controller, abbreviated as BMC)). Currently, the location of server hardware failures is mainly dependent on the one-key logging function of the BMC. Specifically, the BMC generates a log by monitoring the running state (such as the power supply working state, the core component temperature, the voltage stability and the like) of the hardware functional module in real time, and primarily locates the abnormal hardware functional module with faults based on the log. However, the above method for locating the hardware fault of the server by the log of the BMC has a problem of low locating efficiency. Disclosure of Invention The application provides a hardware fault positioning method, a server system and a server, which are used for solving the problem of lower positioning efficiency in a mode of positioning hardware faults of the server through a log of a BMC in the related art. In a first aspect, the application provides a hardware fault locating method, applied to a server system, the server system comprises a BMC, a complex programmable logic device (Complex Programmable Logic Device, CPLD for short) and a nonvolatile memory chip which are sequentially connected in a communication way, the method comprises the steps of responding to the BMC to monitor that an abnormal hardware functional module exists in the server system, and sending a hardware signal acquisition instruction to the CPLD; the CPLD responds to the hardware signal acquisition instruction, acquires hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal acquisition instruction, transmits the hardware signal data to the nonvolatile memory chip for storage based on the first communication link, responds to the BMC monitoring abnormal reproduction of the abnormal hardware functional module, and reads the hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on the second communication link, wherein the hardware signal data is used for positioning the hardware fault root cause of the abnormal hardware functional module. In one possible implementation, the hardware signal acquisition instruction carries communication link identification information, and the method further includes switching the current communication link to a first communication link for connecting the CPLD with the nonvolatile memory chip according to the communication link identification information in response to the hardware signal acquisition instruction. In one possible implementation, in response to the BMC monitoring abnormal reproduction of the abnormal hardware functional module, reading hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on the second communication link, wherein the method comprises the steps of responding to the BMC monitoring abnormal reproduction of the abnormal hardware functional module, the BMC sending a communication link switching instruction to the CPLD, the CPLD responding to the communication link switching instruction, switching the current communication link into the second communication link for connecting the BMC with the nonvolatile memory chip, and the BMC reading the hardware signal data corresponding to the abnormal hardware functional module from the nonvolatile memory chip based on the second communication link. In one possible implementation, the hardware signal acquisition instruction further carries a hardware signal type and a sampling frequency, the CPLD responds to the hardware signal acquisition instruction and acquires hardware signal data corresponding to the abnormal hardware functional module in real time according to the hardware signal acquisition instruction, and the CPLD responds to the hardware signal acquisition instruction and acquires hardw