CN-121996495-A - AI host system and maintenance management optimization method thereof
Abstract
The invention provides an AI host system and a maintenance management optimization method thereof, belonging to the technical field of operation and maintenance management and hardware control of a computer system. The system realizes local centralized control through the main control board and the touch screen, has a fault memory function, supports automatic parameter recovery after power failure, and realizes accurate temperature control and load prediction heat dissipation through a multipoint temperature monitoring and linkage heat dissipation mechanism. The invention improves the operation and maintenance efficiency of the system and the operation stability in a high-temperature environment, and is suitable for high-density AI computing scenes.
Inventors
- YU ZHICHUN
- ZHANG XIAOQING
Assignees
- 广州芯伟达智能科技有限公司
- 科创智联科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251202
Claims (10)
- 1. An AI host system, comprising: Chassis, and integrated within the chassis: the AI hosts are used for executing AI reasoning tasks; A main control board having a nonvolatile memory; the localization centralized control interface is connected with the main control board and used for displaying the running states of all the AI hosts and receiving control instructions of a user on a single host or a plurality of hosts; The fault memory module is functionally realized by the main control board and the nonvolatile memory thereof and is used for capturing and storing fault parameters when the AI host machine fault is detected, and automatically recovering to a pre-fault state after the system is powered on again; the accurate temperature monitoring and linkage heat dissipation module comprises a plurality of temperature sensors arranged in an AI host region and a heat dissipation unit controlled by the main control board; The main control board is configured to uniformly execute fault memory logic and a hierarchical linkage heat dissipation strategy.
- 2. The system of claim 1, wherein the fault memorization module is configured to perform the following process: When detecting hardware error, temperature overrun or power abnormality of the AI host, immediately interrupting the current task, capturing the equipment identifier, working mode and operation configuration parameters of the host, and storing the equipment identifier, working mode and operation configuration parameters as fault parameter packets into the nonvolatile memory; And (3) state recovery, namely after the system is powered on again, firstly checking fault records in the nonvolatile memory, if an effective fault event is identified, automatically loading operation configuration parameters stored before the fault to a corresponding AI host, triggering the host to execute a power-on self-checking flow, and recovering normal operation after the self-checking is passed.
- 3. The system of claim 1, wherein in the precision temperature monitoring and linked heat dissipation module, the master control board is configured to perform the following policies: The threshold linkage control is that a multi-stage temperature threshold is preset, the main control board compares the temperature value acquired by each sensor with the current threshold in real time, and PWM signals are output according to the comparison, so that the running speed of the heat radiating unit is dynamically adjusted; And the main control board predicts the short-term heating trend of the AI hosts by analyzing the real-time calculation load of the AI hosts and improves the operation strength of the heat dissipation unit in advance based on the prediction result.
- 4. The system of claim 3, wherein the multi-level temperature threshold comprises: The first threshold value is lower than the first threshold value, and the heat radiating unit runs at a low speed; a second threshold, which is higher than the first threshold but lower than the second threshold, the heat dissipation unit operates at a medium speed; a third threshold, wherein the heat dissipation unit operates at a high speed when the third threshold is higher than the second threshold but lower than the third threshold; And the fourth threshold value is higher than the fourth threshold value, and the heat radiating unit operates at a super high speed and triggers a system level alarm.
- 5. The system of claim 1, wherein the implementation of load prediction heat dissipation comprises periodically obtaining the CPU utilization rate and the memory occupancy rate of each AI host by the main control board, predicting the internal temperature rise of the chassis in a future period of time through a pre-stored heat generation calculation model, and if the predicted temperature exceeds a next-stage threshold, increasing the fan rotation speed to a corresponding level in advance.
- 6. A maintenance management optimization method of an AI host system, applied to the system according to any one of claims 1 to 5, the method comprising: Monitoring the state of an AI host computer through the main control board, capturing the current fault parameters and storing the current fault parameters into a nonvolatile memory when a fault is detected; The accurate temperature monitoring and linkage heat dissipation step comprises the steps of acquiring temperature data in a case in real time through a plurality of temperature sensors and uploading the temperature data to the main control board through an I2C bus, wherein the main control board executes a hierarchical linkage heat dissipation strategy according to the temperature data and AI host load prediction, and dynamically controls the operation of a heat dissipation unit.
- 7. The method of claim 6, wherein the restoring to the pre-fault state comprises automatically loading the pre-fault stored operation configuration parameters, and performing a self-checking procedure to verify the system state, wherein the normal operation can be put into after the self-checking is passed.
- 8. The method of claim 6, wherein the hierarchical ganged heat dissipation strategy comprises: Comparing the acquired temperature value with a preset multi-stage temperature threshold value, and outputting a PWM signal according to the comparison result to adjust the rotating speed of the heat radiating unit; Based on real-time load data of the AI host, the heating trend of the AI host is predicted, and the operation strength of the heat radiating unit is adjusted in advance according to the heating trend, so that preventive heat radiation is realized.
- 9. The method of claim 8, wherein the pre-conditioning includes the main control board immediately raising the fan speed to a target level when it is predicted that the local zone temperature will exceed the next level threshold for 30 seconds in the future, rather than waiting for the actual temperature to reach the threshold before proceeding.
- 10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 6 to 9.
Description
AI host system and maintenance management optimization method thereof Technical Field The invention relates to the technical field of operation and maintenance management and hardware control of computer systems, in particular to an AI host system with intelligent fault memory and accurate linkage heat dissipation functions and a maintenance management optimization method thereof. Background In the scene of security monitoring, edge computing and the like requiring high-density deployment of an AI host, the reliability and maintainability of the system are important. However, existing systems have significantly shorter boards in terms of operation and maintenance management and state control: First, in terms of fault recovery, the conventional out-of-band management mode (such as BMC) can only implement remote switching of hardware, but cannot record and recover the software configuration and the running state of the service layer. When a certain component of the system fails or suffers unexpected power failure, operation and maintenance personnel often need to attend to the field to restore power supply, the parameters are manually reconfigured, the host is started one by one, the efficiency is low, errors are easy to occur, the service interruption time is prolonged, and the requirement of 7x24 hours continuous operation cannot be met. Second, in terms of heat dissipation management, conventional systems are mostly passively responsive or coarsely controlled based on a single temperature point. Because a plurality of AI hosts are concentrated to dispel the heat under the high density deployment, easily produce local high temperature region, traditional scheme can't accurate perception and in time respond, easily lead to equipment damage because of overheated, shorten equipment life-span. Meanwhile, the independent temperature control of a plurality of hosts may cause heat dissipation policy conflict, and overall heat dissipation efficiency is reduced. Therefore, there is an urgent need for an AI host system and method that can implement intelligent operation and maintenance, fast failure recovery, and precise thermal management. Disclosure of Invention The invention provides an AI host system and a maintenance management optimization method thereof, which are high in operation and maintenance efficiency, good in system stability and capable of effectively preventing overheat faults. In order to achieve the above purpose, the invention adopts the following technical scheme: In a first aspect, the present invention provides an AI host system comprising: Chassis, and integrated within the chassis: the AI hosts are used for executing AI reasoning tasks; A main control board having a nonvolatile memory; the localization centralized control interface is connected with the main control board and used for displaying the running states of all the AI hosts and receiving control instructions of a user on a single host or a plurality of hosts; The fault memory module is functionally realized by the main control board and the nonvolatile memory thereof and is used for capturing and storing fault parameters when the AI host machine fault is detected, and automatically recovering to a pre-fault state after the system is powered on again; the accurate temperature monitoring and linkage heat dissipation module comprises a plurality of temperature sensors arranged in an AI host region and a heat dissipation unit controlled by the main control board; The main control board is configured to uniformly execute fault memory logic and a hierarchical linkage heat dissipation strategy. Preferably, the fault memorization module is configured to perform the following procedure: When detecting hardware error, temperature overrun or power abnormality of the AI host, immediately interrupting the current task, capturing the equipment identifier, working mode and operation configuration parameters of the host, and storing the equipment identifier, working mode and operation configuration parameters as fault parameter packets into the nonvolatile memory; And (3) state recovery, namely after the system is powered on again, firstly checking fault records in the nonvolatile memory, if an effective fault event is identified, automatically loading operation configuration parameters stored before the fault to a corresponding AI host, triggering the host to execute a power-on self-checking flow, and recovering normal operation after the self-checking is passed. Preferably, in the accurate temperature monitoring and linkage heat dissipation module, the main control board is configured to execute the following strategies: The threshold linkage control is that a multi-stage temperature threshold is preset, the main control board compares the temperature value acquired by each sensor with the current threshold in real time, and PWM signals are output according to the comparison, so that the running speed of the heat radiating unit is dynamically adj