Search

CN-121996518-A - Heat dissipation performance monitoring system and method of GPU cluster, electronic equipment and storage medium

CN121996518ACN 121996518 ACN121996518 ACN 121996518ACN-121996518-A

Abstract

The invention relates to a heat radiation performance monitoring system, a heat radiation performance monitoring method, electronic equipment and a storage medium of a GPU cluster, belongs to the technical field of data processing, and solves the problems of training interruption and efficiency reduction caused by partial GPU (graphics processing Unit) power reduction caused by uneven heat radiation of the GPU cluster. The heat radiation performance monitoring method of the GPU cluster comprises the steps of acquiring system indexes of the GPU cluster from each GPU in real time, calculating the temperature change rate and heat radiation efficiency coefficient of each GPU according to the system indexes, and generating alarm information under the condition that the temperature change rate is abnormal and/or the heat radiation efficiency coefficient is abnormal. The heat dissipation efficiency coefficient and the temperature change rate are calculated, and the heat dissipation efficiency coefficient abnormality and the temperature change rate abnormality are marked, so that the degradation node can be automatically identified, and the positioning speed is improved.

Inventors

  • Request for anonymity

Assignees

  • 摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date
20260508
Application Date
20251226

Claims (20)

  1. 1. The method for monitoring the heat dissipation performance of the GPU cluster is characterized by comprising the following steps of: Acquiring system indexes of the GPU clusters from each GPU in real time; Calculating the temperature change rate and the heat dissipation efficiency coefficient of each GPU according to the system index; And generating alarm information under the condition that the temperature change rate is abnormal and/or the heat dissipation efficiency coefficient is abnormal.
  2. 2. The method for monitoring heat dissipation performance of GPU clusters according to claim 1, wherein the calculating the temperature change rate and the heat dissipation efficiency coefficient of each GPU according to the system index comprises: The GPU core temperature, the cooling liquid temperature and the GPU power consumption in the system index are read in real time; Calculating the heat dissipation efficiency coefficient in real time based on the GPU core temperature, the cooling liquid temperature and the GPU power consumption; acquiring a temperature curve of each GPU based on the GPU core temperature; and calculating the temperature change rate according to the temperature curve of each GPU.
  3. 3. The method for monitoring heat dissipation performance of a GPU cluster according to claim 2, further comprising: According to the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster, calculating the average value of the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster and the standard deviation of the heat dissipation efficiency coefficients of the GPUs; and taking the difference between the average value of the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster and three times of the standard deviation of the heat dissipation efficiency coefficient of each GPU as a heat dissipation efficiency coefficient threshold.
  4. 4. A method of monitoring heat dissipation performance of a GPU cluster according to claim 3, further comprising: Determining that the heat dissipation efficiency coefficient is abnormal and marking the heat dissipation efficiency for the corresponding GPU under the condition that the heat dissipation efficiency coefficient is larger than or equal to the heat dissipation efficiency coefficient threshold; And displaying the heat dissipation efficiency coefficient and the corresponding GPU identifier under the condition that the heat dissipation efficiency coefficient is smaller than the heat dissipation efficiency coefficient threshold.
  5. 5. The method for monitoring heat dissipation performance of a GPU cluster according to claim 1, further comprising: determining that the temperature change rate is abnormal and marking the abnormality of the temperature change rate for the corresponding GPU under the condition that the temperature change rate is larger than a set temperature change rate threshold; And displaying the temperature change rate and the corresponding GPU identification under the condition that the temperature change rate is smaller than or equal to the set temperature change rate threshold.
  6. 6. The method of any of claims 1-5, wherein generating the alert information comprises: Generating temperature change rate abnormality warning information according to the abnormality mark of the temperature change rate and the corresponding GPU number, and/or And generating abnormal heat dissipation efficiency coefficient alarm information according to the abnormal heat dissipation efficiency coefficient mark and the corresponding GPU number.
  7. 7. A method of monitoring heat dissipation performance of a GPU cluster according to claim 3, further comprising: calculating the standard deviation of the heat dissipation efficiency coefficient of the GPU cluster and the average value of the heat dissipation efficiency coefficients of the GPU clusters based on the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster; Calculating a node difference index based on the standard deviation of the heat dissipation efficiency coefficient of the GPU cluster and the average value of the heat dissipation efficiency coefficient of the GPU cluster; and determining and displaying the distribution condition of the heat dissipation efficiency coefficient among different GPU clusters in a laboratory according to the node difference index.
  8. 8. A heat dissipation performance monitoring system for a GPU cluster, comprising: the data acquisition module is used for acquiring system indexes of the GPU clusters from each GPU in real time; And the monitoring device is used for calculating the temperature change rate and the heat dissipation efficiency coefficient of each GPU according to the system index, and generating alarm information under the condition of determining that the temperature change rate is abnormal and/or the heat dissipation efficiency is abnormal.
  9. 9. The system for monitoring the heat dissipation performance of a GPU cluster according to claim 8, wherein the system index comprises a GPU core temperature, a coolant temperature and GPU power consumption, the monitoring device comprises a heat dissipation analysis module, wherein, The heat dissipation analysis module is used for calculating the heat dissipation efficiency coefficient based on the GPU core temperature, the cooling liquid temperature and the GPU power consumption which are acquired in real time, calculating a node difference index based on the heat dissipation efficiency coefficient of each GPU, and storing the heat dissipation efficiency coefficient and the node difference index.
  10. 10. The system of claim 9, wherein the heat dissipation efficiency coefficient represents a ratio of a temperature difference between the GPU core temperature and the coolant temperature to the GPU power consumption.
  11. 11. The heat dissipation performance monitoring system of the GPU cluster according to claim 10, wherein the monitoring device further comprises a display module and an alarm module; The heat radiation analysis module further comprises an abnormality detection marking sub-module, wherein the abnormality detection marking sub-module further comprises a heat radiation efficiency comparison unit, wherein, When the heat radiation efficiency coefficient is greater than or equal to the heat radiation efficiency coefficient threshold, the heat radiation efficiency coefficient is simultaneously provided for the display module and the alarm module; and when the heat dissipation efficiency coefficient is smaller than the heat dissipation efficiency coefficient threshold, providing the heat dissipation efficiency coefficient for the display module.
  12. 12. The system according to claim 11, wherein the threshold of heat dissipation efficiency is a difference between a mean of heat dissipation efficiency coefficients of the GPU cluster and a standard deviation of three times the GPU's own heat dissipation efficiency coefficient.
  13. 13. The heat dissipation performance monitoring system of the GPU cluster according to claim 11, wherein the abnormality detection marking sub-module further comprises a heat dissipation abnormality marking unit configured to mark a heat dissipation performance abnormality of a current GPU and provide the heat dissipation performance abnormality mark of the current GPU and a corresponding current GPU number to the alert module and the display module simultaneously when the heat dissipation performance coefficient is equal to or greater than the heat dissipation performance coefficient threshold.
  14. 14. The system for monitoring the thermal performance of a GPU cluster as set forth in claim 11, wherein the thermal analysis module further comprises a temperature profile calculation sub-module, wherein, The temperature curve calculation submodule is used for obtaining the temperature curve of each GPU based on the core temperature of the GPU, calculating the temperature change rate of each GPU according to the temperature curve of each GPU, and providing the temperature change rate for the abnormality detection marking submodule.
  15. 15. The heat dissipation performance monitoring system of a GPU cluster as recited in claim 14, wherein the anomaly detection flag sub-module further comprises a temperature change comparison unit, wherein, When the temperature change rate is smaller than or equal to a temperature change rate threshold value, the temperature change rate of each GPU is provided for the display module; and when the temperature change rate is larger than the temperature change rate threshold, providing the abnormal temperature change rate of the current GPU to the display module and the alarm module at the same time.
  16. 16. The heat dissipation performance monitoring system of a GPU cluster as recited in claim 14, wherein the anomaly detection flag sub-module further comprises a temperature anomaly flag unit, wherein, The temperature anomaly marking unit is used for marking the temperature change rate anomaly of the current GPU and providing the temperature change rate anomaly mark of the current GPU and the corresponding current GPU number for the alarm module and the display module when the temperature change rate is greater than a temperature change rate threshold.
  17. 17. The system according to claim 10, wherein the heat dissipation analysis module further comprises a node difference calculation sub-module for calculating a ratio between a standard deviation of heat dissipation efficiency coefficients of the GPU clusters and a mean of heat dissipation efficiency coefficients of the GPU clusters as the node difference index, The standard deviation of the heat dissipation efficiency coefficient of the GPU cluster represents the standard deviation of the heat dissipation efficiency coefficient of each GPU in the GPU cluster; the average value of the heat dissipation efficiency coefficients of the GPU clusters represents the average value of the heat dissipation efficiency coefficients of all the GPUs in the GPU clusters; the heat dissipation analysis module is also used for determining the distribution condition of heat dissipation efficiency coefficients among different GPU clusters in a laboratory according to the node difference index and providing the distribution condition to a display module of the monitoring device.
  18. 18. The system for monitoring heat dissipation performance of a GPU cluster according to claim 11, wherein the GPU cluster comprises a time-series database for storing system metrics from the data acquisition device; the monitoring device also comprises a data reading module for reading the system index from the time sequence database, and then respectively providing the system index to the heat radiation analysis module, the alarm module and the display module, wherein the system index also comprises the calculation power utilization rate of the GPU and the heat radiation equipment parameters, The heat dissipation device parameters include water pressure of a water pump, rotational speed of a fan, and flow of a flowmeter.
  19. 19. The GPU cluster heat dissipation performance monitoring system of claim 18, wherein the alert module is configured to: Alarming according to the temperature change rate abnormality mark and the corresponding current GPU number and/or the heat radiation efficiency abnormality mark and the corresponding current GPU number in a mode of alarming indicator or notification message, and And when the GPU core temperature, the GPU power consumption, the computing power utilization rate or the heat dissipation device parameter exceeds the corresponding threshold value, alarming the GPU core temperature, the GPU power consumption or the computing power utilization rate and the corresponding GPU or the heat dissipation device parameter and the corresponding heat dissipation device.
  20. 20. The heat dissipation performance monitoring system of the GPU cluster according to claim 18, wherein the display module is configured to display the heat dissipation performance coefficient and the temperature change rate of each GPU in the GPU cluster in a monitor panel manner, wherein the heat dissipation performance abnormality flag and/or the temperature change rate abnormality flag and the corresponding abnormality card position are displayed in different colors, and And the display module is used for displaying the system index in a mode of a monitoring panel.

Description

Heat dissipation performance monitoring system and method of GPU cluster, electronic equipment and storage medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a system and method for monitoring heat dissipation performance of a GPU cluster, an electronic device, and a storage medium. Background The large-scale graphics processor (GPU, graphics Processing Unit) clusters generally adopt a liquid cooling/air cooling heat dissipation scheme, but the difference of heat dissipation performance among nodes is caused by installation errors, equipment aging, refrigerant distribution non-uniformity and other factors. Prometaus is used as an open source monitoring system, hardware indexes can be acquired through Exporter plug-ins, but the native does not support differential analysis and abnormal positioning of the heat dissipation performance of the GPU. Only central processing unit (CPU, central Processing Unit) information can be obtained. In the related art, indexes such as the temperature, the utilization rate and the like of a single GPU are acquired through nvidia-smi commands, but the scheme only provides original data, lacks heat dissipation performance attenuation detection based on historical data, and further lacks automatic marking and root cause positioning capability of an abnormal card. Therefore, in the related art, the heat radiation performance difference cannot be quantified, namely, the existing monitoring only displays the temperature of a single card, a heat radiation capability evaluation model is not established, and a heat radiation degradation node cannot be identified. Disclosure of Invention In view of the above analysis, the embodiment of the invention aims to provide a heat dissipation performance monitoring system, a heat dissipation performance monitoring method, electronic equipment and a storage medium of a GPU cluster, which are used for solving the problems that the heat dissipation performance difference cannot be quantified, the existing monitoring only displays the temperature of a single card, a heat dissipation capacity evaluation model is not established, and heat dissipation degradation nodes cannot be identified. On one hand, the embodiment of the invention provides a heat radiation performance monitoring method of a GPU cluster, which comprises the steps of acquiring system indexes of the GPU cluster from each GPU in real time, calculating the temperature change rate and heat radiation efficiency coefficient of each GPU according to the system indexes, and generating alarm information under the condition that the temperature change rate is abnormal and/or the heat radiation efficiency coefficient is abnormal. The technical scheme has the beneficial effects that the degradation node can be automatically identified by calculating the heat dissipation efficiency coefficient eta, so that the positioning speed is improved. And the abnormal card can be automatically isolated, the interruption of the training of the whole machine caused by overheat of a single node is avoided, and the utilization rate of the cluster calculation force is improved. Based on further improvement of the method, the temperature change rate and the heat dissipation efficiency coefficient of each GPU are calculated according to the system index, and the method comprises the steps of reading the GPU core temperature, the cooling liquid temperature and the GPU power consumption in the system index in real time, calculating the heat dissipation efficiency coefficient based on the GPU core temperature, the cooling liquid temperature and the GPU power consumption in real time, obtaining the temperature curve of each GPU based on the GPU core temperature, and calculating the temperature change rate according to the temperature curve of each GPU. Based on further improvement of the method, the method further comprises the steps of calculating the average value of the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster and the standard deviation of the heat dissipation efficiency coefficients of the GPUs according to the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster, and taking the difference value between the average value of the heat dissipation efficiency coefficients of all the GPUs in the GPU cluster and three times of the standard deviation of the heat dissipation efficiency coefficients of all the GPUs as the heat dissipation efficiency coefficient threshold value. Based on further improvement of the method, the method further comprises the steps of determining that the temperature change rate is abnormal and performing abnormal marking of the heat dissipation efficiency for the corresponding GPU when the heat dissipation efficiency coefficient is larger than or equal to the heat dissipation efficiency coefficient threshold value, and displaying the heat dissipation efficiency coeffic