CN-121984834-A - Heterogeneous computing network resource abnormity monitoring and early warning method
Abstract
A heterogeneous computing network resource anomaly monitoring and early warning method can avoid the situation that monitoring systems are split, is favorable for assisting in root cause positioning and improving anomaly investigation efficiency, further is favorable for determining the influence range of anomalies on tenant service based on root cause and topology, comprises the steps of 1, carrying out uniform acquisition and processing on heterogeneous resource monitoring data based on a uniform monitoring model and an index processing rule, 2, carrying out monitoring index anomaly detection based on an index monitoring threshold, 3, searching a multi-index anomaly joint detection rule and carrying out joint detection analysis on single anomaly monitoring index by combining with a resource topology, and directly outputting anomaly indexes if no matching rule exists, 4, carrying out audit resource topology, carrying out anomaly influence range inference and outputting an inference result, and 5, carrying out uniform anomaly early warning according to the inference result.
Inventors
- AN PING
- SHEN YINGNAN
- JIA LEI
Assignees
- 北京直真科技股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260129
Claims (9)
- 1. The heterogeneous computing network resource abnormality monitoring and early warning method is characterized by comprising the following steps of: Step 1, uniformly collecting and processing heterogeneous resource monitoring data based on a uniform monitoring model and an index processing rule, and outputting standardized monitoring index data; Step 2, monitoring index anomaly detection is carried out based on an index monitoring threshold value, and anomaly monitoring index object information is output; Step 3, searching multi-index abnormal joint detection rules aiming at the abnormal monitoring index object information, carrying out joint detection analysis on single abnormal monitoring indexes by combining with resource topology, outputting joint detection results, and directly outputting abnormal indexes if no matching rules exist; step 4, auditing the resource topology according to the joint detection result or the abnormal index, deducing the abnormal influence range and outputting the deduced result; and step 5, carrying out unified abnormality early warning according to the inferred result, wherein the step comprises the steps of generating an abnormality notification according to subscription information, abnormality information and the inferred result and sending the abnormality notification to related subscribers according to an abnormality subscription rule.
- 2. The heterogeneous computing network resource anomaly monitoring and early warning method according to claim 1, wherein the anomaly subscription rule, the multi-index anomaly joint detection rule, the index monitoring threshold value, and the unified monitoring model and index processing rule are all from a monitoring model and business rule management module.
- 3. The method for monitoring and early warning heterogeneous computing network resource abnormality according to claim 1, characterized in that the resource topology in step 3 and step 4 is constructed by using a graph database in the computing network topology, including constructing a resource topology model by using a resource entity as a node and an association relationship as an edge, and adding dynamic and static attributes to the node and the edge.
- 4. The method for monitoring and early warning of heterogeneous computing network resource abnormality according to claim 3, wherein the maintenance of the resource topology model comprises classifying topology data according to change frequency and change influence of the topology data, different levels adopt different update schemes to ensure accuracy of the topology data, and reduce probability of topology abnormality caused by update data abnormality.
- 5. The heterogeneous computing network resource anomaly monitoring and early warning method according to claim 3, wherein the nodes and the edges are classified, the classification comprises a static level, a quasi-static level and a dynamic level, the operation strategies of the static level are automatic addition and manual audit mark deletion, the operation strategies of the quasi-static level are automatic mark deletion when the operation strategies of the quasi-static level are automatic addition and n times of collection are not stored, n is a positive integer, the operation strategies of the dynamic level are real-time mark deletion according to collection results, and objects marked for deletion are automatically cleaned at regular time according to a strategy system.
- 6. The method for monitoring and early warning heterogeneous computing network resource abnormality according to claim 3, wherein the resource topology model comprises a topology relation update flow, and the specific steps are as follows: Step A1, inputting topology data information; a2, analyzing the updated object information; Step A3, judging whether the model is a marked deleted model, if not, finishing updating the model, and if so, entering step A4; step A4, judging whether the static level is the static level, if the static level is the static level, marking the static level, judging whether the verification passes after manual verification, if the verification passes, entering a step A7, if the verification does not pass, ending, and if the static level is not the static level, entering a step A5; Step A5, judging whether the quasi-static level is a quasi-static level, if so, judging whether the detection times are larger than a set value, if so, entering step A7, if not, ending after detecting times +1, and if not, entering step A6; step A6, judging whether the dynamic level is a dynamic level, if not, ending, and if so, entering step A7; and step A7, finishing the process after deleting the marked object.
- 7. The method for monitoring and early warning heterogeneous computing network resource abnormality according to claim 1, wherein the step 1 includes a data normalization process flow, and the specific steps are as follows: step B1, inputting an original monitoring index object of a manufacturer after data analysis; Step B2, inquiring normalization rules based on the original index object identification tuples of the manufacturer; step B3, judging whether the rule is matched, if not, ending, and if so, entering step B4; Step B4, judging whether the rule execution condition is met, if not, caching the rule and continuing to wait for input, returning to step B1, and if so, entering step B5; And step B5, normalizing according to the processing rule and ending.
- 8. The method for monitoring and early warning of heterogeneous computing network resource anomalies according to claim 1, wherein the step 2 comprises a threshold-based anomaly detection flow, and the specific steps are as follows: Step C1, inputting monitoring index object information; step C2, inquiring corresponding static threshold value and dynamic threshold value information according to the index object type and the index attribute name; step C3, judging whether the attribute value of the monitoring index exceeds a threshold value, if not, ending, and if so, entering a step C4; and step C4, recording the monitoring result to the monitoring index object, outputting the monitoring index object and ending the monitoring.
- 9. The method for monitoring and early warning heterogeneous computing network resource abnormality according to claim 1, wherein the step 3 includes a multi-index abnormality joint detection flow based on a threshold, and the specific steps are as follows: step D1, inputting the detected abnormal index object information; Step D2, inquiring a joint detection rule, namely a joint detection rule which is currently being executed, based on the input object information; step D3, judging whether the joint detection rule and the abnormal index object are successfully matched, if not, ending, and if so, entering a step D4; Step D4, carrying out data arrival waiting based on the delay time configured by the detection rule; step D5, inquiring the latest index value of the node related index based on the joint detection rule after the waiting time is reached; step D6, judging whether to trigger an abnormal joint detection rule, if not, ending, and if so, entering a step D7; Step D7, outputting abnormality; And D8, detecting whether the object is abnormal, if so, ending, and if not, outputting a joint detection result and ending.
Description
Heterogeneous computing network resource abnormity monitoring and early warning method Technical Field The invention belongs to the technical field of computing power and network resource monitoring, and particularly relates to a heterogeneous computing network resource anomaly monitoring and early warning method. Background Heterogeneous computing devices (e.g., GPUs/NPUs of different vendors, different architectures, GPUs, graphics Processing Unit, graphics cards; NPUs, neural Processing Unit, neural network processors) and high-speed network devices (e.g., switches, network cards supporting RoCEv or InfiniBand protocols, infiniBand being InfiniBand, roCEv being RDMA over Converged Ethernet version 2, converged ethernet RDMA protocol version 2, RDMA being remote direct memory access, remote Direct Memory Access) and storage devices in a large scale computing power cluster are typically managed by separate monitoring systems. For example, the monitoring and early warning of the computing power resources are realized through GPU management tools provided by the manufacturers based on the computing power resources of different manufacturers, network equipment monitoring indexes are collected through SNMP, telemetry, gNMI and other interfaces to realize Network equipment basic indexes and RoCE/InfiniBand personalized index monitoring (SNMP, simple Network Management Protocol, simple Network management protocol; TELEMETRY is telemetry; gNMI is Google Network MANAGEMENT INTERFACE, google Network management interface), data such as IOPS, throughput, read-write delay, disk health status and the like are obtained through management interfaces (such as SMI-S, CLI and REST API) provided by equipment manufacturers to realize abnormal monitoring (IOPS, input/Output Operations Per Second, read-write operation times per second; REST is Representational STATE TRANSFER, expression state transfer; API is Application Programming Interface, application program interface; CLI is Command-LINE INTERFACE, command line interface; SMI-S is Storage MANAGEMENT INTERFACE Specification Storage management interface standard). Each monitoring system is usually deployed independently, data are stored in respective databases after being collected, and alarms are triggered through independent alarm rule engines. The operation and maintenance personnel need to switch among a plurality of monitoring platforms, and the fault root cause is manually associated and analyzed. Part of advanced monitoring systems realize centralized management aiming at the data sources, and carry out simple aggregation on each service monitoring capability. In the prior art, different service monitoring functions are mutually split, a unified service model and a flexible acquisition frame are lacking, and multi-focusing is performed on single-device performance index threshold value alarming, so that causal or association relations among cross-device and cross-index anomalies are difficult to find, root cause positioning is difficult, meanwhile, the capability of deducing an anomaly influence range is lacking, and influence of anomalies on tenant tasks cannot be visually checked. Disclosure of Invention Aiming at the defects or the shortcomings in the prior art, the invention provides a heterogeneous computing network resource anomaly monitoring and early warning method, which can avoid the mutual splitting of monitoring systems by carrying out unified modeling, unified acquisition processing and unified monitoring on heterogeneous computing network resources, and is beneficial to assisting root cause positioning and improving anomaly investigation efficiency by introducing a topology-based joint detection mechanism to find causal or association relations between anomalies of cross equipment and cross indexes, and further is beneficial to defining the influence range of anomalies for tenant services based on the inference of the root cause and the topology anomaly influence range. The technical scheme of the invention is as follows: the heterogeneous computing network resource abnormality monitoring and early warning method is characterized by comprising the following steps of: Step 1, uniformly collecting and processing heterogeneous resource monitoring data based on a uniform monitoring model and an index processing rule, and outputting standardized monitoring index data; Step 2, monitoring index anomaly detection is carried out based on an index monitoring threshold value, and anomaly monitoring index object information is output; Step 3, searching multi-index abnormal joint detection rules aiming at the abnormal monitoring index object information, carrying out joint detection analysis on single abnormal monitoring indexes by combining with resource topology, outputting joint detection results, and directly outputting abnormal indexes if no matching rules exist; step 4, auditing the resource topology according to the joint detection result or the abnormal inde