Search

CN-121255000-B - DPU and server cluster collaborative power-down method, system, DPU, equipment and storage medium

CN121255000BCN 121255000 BCN121255000 BCN 121255000BCN-121255000-B

Abstract

The application provides a DPU and server cluster cooperative power-down method, a DPU and server cluster cooperative power-down system, DPU equipment and storage media. The DPU comprises at least one data processing core, a back-end Baseboard Management Controller (BMC) and front-end BMCs integrated with a plurality of front-end servers through a BMC special management network, wherein the back-end BMC is configured to maintain a front-end server state table, respond to a power-down trigger event, control each front-end server to power down in sequence according to the front-end server state table, and then execute power-down operation of the DPU. By implementing the application, a plurality of front-end servers can be powered down orderly, and the front-end servers and the rear-end DPU are powered down cooperatively, so that the safety of the whole system is improved.

Inventors

  • LIU JINGTAO
  • CHEN ANQING
  • LIU PEIBAO

Assignees

  • 深圳云豹智能股份有限公司

Dates

Publication Date
20260508
Application Date
20251201

Claims (13)

  1. 1. A data processing unit DPU, comprising: at least one data processing core for providing acceleration services to one or more front-end servers external thereto; the back-end base plate management controller BMC is coupled with the at least one data processing core and is communicated with each front-end BMC integrated with the plurality of front-end servers through a BMC special management network; Wherein the back-end baseboard management controller BMC is configured to: Maintaining a front-end server state table, the state table containing entries corresponding to each front-end server coupled to the DPU; In the DPU initialization stage, FRU information of each front-end BMC is obtained, and a main board serial number and a BMC serial number in the FRU information are compared with records in a front-end server state table to verify the validity of the state table; and responding to a power-down triggering event, controlling each front-end server to carry out sequential power-down according to the front-end server state table, and executing the power-down operation of the DPU after confirming that all the front-end servers are powered down safely.
  2. 2. The DPU of claim 1, wherein the back-end baseboard management controller BMC comprises: The state table maintenance module is used for maintaining a state table of a front-end server, and the state table is used for dynamically recording the state information of each front-end server coupled with the DPU; and the power-down cooperative processing module is used for responding to a power-down trigger event, obtaining the current power state of each front-end server according to the front-end server state table, initiating the power-down flow of the front-end servers which are not powered down, and executing the power-down operation of the DPU after confirming that all the front-end servers are powered down safely.
  3. 3. The DPU of claim 2, wherein the state table maintenance module further comprises: the state table storage unit is used for storing a front-end server state table, wherein the front-end server state table comprises entries corresponding to each front-end server, and the entries comprise a mainboard serial number, a power supply state field, a DPU service state and BMC information; The state verification unit is used for acquiring FRU information of each front-end server BMC through an IPMI command in a DPU initialization stage, comparing a main board serial number and a BMC serial number in the FRU information with records in a front-end server state table respectively, verifying the validity of the state table entry, and marking the corresponding entry as an invalid state when the state table entry is not matched; And the state updating unit is used for receiving the power state change event or the service state change event actively reported by each front-end server BMC through the BMC special management network and synchronously updating the corresponding field of the state table.
  4. 4. The DPU of claim 2 or 3, wherein the power down co-processing module further comprises: The power-down sequence judging unit sequentially powers down the front-end servers with the power supply state being ON according to the real-time state data of each front-end server provided by the state table maintenance module, and turns off the service of the DPU to the front-end servers after confirming that all the front-end servers are safely powered down, and then executes the power-down operation of the DPU; the timer management unit configures a front-end cluster power-down timeout threshold and a DPU service shutdown timeout threshold, and triggers a forced power-down flow after timeout.
  5. 5. A method for powering down a DPU in conjunction with a server cluster, applied to a system in which a back-end DPU serves multiple front-end servers, the method comprising: Maintaining a front-end server state table in a back-end terminal management controller BMC integrated with the back-end DPU, wherein the front-end server state table is used for dynamically recording state information of each front-end server coupled with the back-end DPU; in the initialization stage of the back-end DPU, a back-end terminal management controller BMC acquires FRU information of each front-end BMC, and compares a main board serial number and a BMC serial number in the FRU information with records in the front-end server state table so as to check the validity of entries in the front-end server state table; during the running of the system, the back-end baseboard management controller BMC receives power state change events actively reported by each front-end BMC through a BMC special management network, and dynamically updates the front-end server state table according to the power state change events; The back-end management controller BMC responds to a power-down triggering event, obtains the current power state of each front-end server according to the front-end server state table, initiates the power-down flow of the front-end servers which are not powered down, and executes the power-down operation of the back-end DPU after confirming that all the front-end servers are powered down safely.
  6. 6. The method of claim 5, wherein the step of maintaining a front-end server state table further comprises: Establishing a front-end server state table, wherein the front-end server state table comprises entries corresponding to each front-end server, wherein the entries comprise a mainboard serial number, a power state field, a rear-end DPU service state and front-end BMC information; In the initialization stage of the back-end DPU, FRU information of the front-end BMC of each front-end server is obtained through an IPMI command, a main board serial number in the FRU information is compared with a BMC serial number and records in a front-end server state table, validity of the state table entry is verified, and when the main board serial number and the BMC serial number are not matched, the corresponding entry is marked as an invalid state; And receiving a power state change event or a service state change event actively reported by the front-end BMCs of the front-end servers through the BMC special management network, and synchronously updating corresponding fields of the state table.
  7. 7. The method according to claim 5 or 6, wherein obtaining a current power state of each front-end server according to the front-end server state table, initiating a power-down procedure for a front-end server that is not powered down, comprises: the back-end DPU traverses the front-end server state table and screens out front-end server items with power states of ON; Sending an orderly power-down command to a front-end BMC of a target front-end server through a BMC special management network, and starting a front-end cluster power-down timer; Before the front-end cluster power-down timer is overtime, continuously monitoring a power-down completion event reported by a front-end BMC of a front-end server, and updating the power state of the corresponding entry of the state table when one event is received; If the front-end cluster power-down timer is overtime and the front-end server does not complete power-down, the rear-end DPU triggers a forced power-down instruction and sends a hard power-off command to the front-end BMC corresponding to the front-end server through the BMC special management network; After the forced power down is completed, the power state in the corresponding entry in the front-end server state table is updated.
  8. 8. The method of claim 5, wherein performing the power-down operation of the back-end DPU after confirming that all front-end servers are powered down safely, further comprises shutting down the back-end DPU's service to the front-end servers after confirming that all front-end servers are powered down safely: after confirming that all front-end servers are powered down safely, the back-end management controller BMC informs the back-end DPU of closing all front-end server services through IPMI, and starts a service closing completion event timer; After receiving the instruction for closing all front-end servers, the back-end DPU closes the back-end service of each front-end server, and each back-end service is closed, the back-end DPU informs the back-end base management controller BMC through the IPMI message, and after receiving the event, the back-end base management controller BMC updates the DPU service state of the corresponding entry in the front-end server state table.
  9. 9. The method of claim 8, wherein the performing the power-down operation of the back-end DPU further comprises: The back-end base plate management controller BMC powers down the back-end DPU and starts a power-down event timer; if the back-end base plate management controller BMC does not sense that the back-end DPU is powered down in the timer, the back-end DPU is powered down forcedly; after receiving the power-down completion of the back-end DPU, the back-end baseboard management controller BMC updates the DPU service states in all the front-end server state tables, and completes the power-down of the back-end DPU.
  10. 10. A DPU and server cluster co-operating power-down system, comprising: back-end DPU employing a data processing unit DPU as claimed in any one of claims 1 to 4; a plurality of front-end servers, each configured with a front-end baseboard management controller BMC; the BMC special management network is used for establishing communication connection between the back-end BMC of the back-end DPU and the front-end base plate management controllers BMCs of the front-end servers; the back-end BMC is configured to obtain the current power state of each front-end server according to the front-end server state table by responding to a power-down trigger event, initiate the power-down flow of the front-end servers which are not powered down, and execute the power-down operation of the back-end DPU after confirming that all the front-end servers are powered down safely.
  11. 11. The system of claim 10, wherein the front end baseboard management controller BMC further comprises: the FRU information feedback unit is used for responding to the information inquiry command from the back-end BMC and feeding back FRU information to the back-end BMC; The state change processing unit is used for actively sending a state change notification to the back-end BMC through the BMC special management network when the managed front-end server power supply state changes; and the power-down processing unit is used for responding to a power-down instruction or a forced power-down instruction from the back-end BMC, executing the power-down operation of the front-end server and feeding back an execution result to the back-end BMC.
  12. 12. An electronic device characterized in that it is deployed with a data processing unit DPU according to any of claims 1 to 4, or a DPU according to claim 10 or 11 in conjunction with a server cluster power down system.
  13. 13. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the DPU and server cluster co-power-down method of any of claims 5 to 9.

Description

DPU and server cluster collaborative power-down method, system, DPU, equipment and storage medium Technical Field The present application relates to the field of data processing units (Data Processing Unit, DPUs), and in particular, to a method, a system, a DPU, a device, and a storage medium for power down in cooperation with a DPU and a server cluster. Background With the evolution of data center architecture, data Processing Units (DPUs) are widely used to offload and accelerate I/O (Input/Output) intensive tasks such as networking, storage, security, etc., so that central processing unit (Central Processing Unit, CPU) resources of servers can be released to focus on business computing. As shown in fig. 1, a schematic architecture diagram of a DPU serving multiple front-end servers is shown. In existing data center hardware architecture, one high-performance back-end DPU can provide network services or storage resources for multiple front-end servers at the same time. In this "one-to-many" service architecture, the backend DPU becomes a key dependency for the proper functioning of all front-end servers. The operating system and application programs of the front-end server need to perform data read-write and network communication through the back-end DPU. If the backend DPU is powered down accidentally or unordered while the front-end servers are still running, all front-end servers will immediately lose their storage or network connection, possibly leading to the following risks in practical applications: Service interruption, namely, the running application program crashes due to I/O errors; Data corruption or loss-ongoing write operations fail to complete, resulting in file system corruption or data inconsistencies; The system is down and the operating system may crash due to the interruption of the critical I/O path. However, the existing server power management generally relies on independent baseboard management controllers (Baseboard Management Controller, BMC) of each server to perform management, and cannot perform cooperative processing, so that there is a hidden danger that the front-end server directly performs power-down operation on the DPU when the front-end server is not powered down. Disclosure of Invention The application aims to solve the technical problem of providing a method, a system, a DPU, equipment and a storage medium for cooperatively powering down a DPU and a server cluster, which can realize orderly powering down of a plurality of front-end servers and cooperatively powering down with a rear-end DPU, thereby improving the safety of a complete machine system. To solve the above technical problem, as an aspect of the present application, there is provided a data processing unit DPU including: at least one data processing core for providing acceleration services to one or more front-end servers external thereto; the back-end base plate management controller BMC is coupled with the at least one data processing core and is communicated with each front-end BMC integrated with the plurality of front-end servers through a BMC special management network; The back-end BMC is configured to maintain a front-end server state table, respond to a power-down trigger event, control each front-end server to power down in sequence according to the front-end server state table, and then execute the power-down operation of the DPU. Wherein, the backend BMC includes: The state table maintenance module is used for maintaining a state table of a front-end server, and the state table is used for dynamically recording the state information of each front-end server coupled with the DPU; and the power-down cooperative processing module is used for responding to a power-down trigger event, obtaining the current power state of each front-end server according to the front-end server state table, initiating the power-down flow of the front-end servers which are not powered down, and executing the power-down operation of the DPU after confirming that all the front-end servers are powered down safely. Wherein the state table maintenance module further comprises: The state table storage unit is used for storing a front-end server state table, wherein the front-end server state table comprises a mainboard serial number, a power supply state field, a DPU service state and BMC information corresponding to each front-end server; The state verification unit is used for acquiring field replaceable unit (Field Replaceable Unit, FRU) information of each front-end server BMC through intelligent platform management interface (IPMI, INTELLIGENT PLATFORM MANAGEMENT INTERFACE) commands in a DPU initialization stage, comparing a main board serial number and a BMC serial number in the FRU information with records in a front-end server state table respectively, verifying the validity of the state table entry, and marking the corresponding entry as an invalid state when the main board serial number and the BMC serial number ar