CN-122027440-A - Machine room fault node identification method and system based on network topology

CN122027440ACN 122027440 ACN122027440 ACN 122027440ACN-122027440-A

Abstract

The invention relates to the technical field of network operation and maintenance and communication monitoring, and discloses a machine room fault node identification method and system based on network topology, wherein the method comprises the steps of acquiring real-time network topology data and link transmission indexes, and constructing a weighted directed graph and an evolution sequence; the method comprises the steps of carrying out time sequence projection on historical data according to a sequence to obtain a node state vector, calculating a deviation value, extracting deviation characteristics and searching to obtain a potential abnormal region if the deviation value exceeds the limit, obtaining a leading index through clustering grouping and evaluating propagation influence to determine a priority sequence, positioning an initial fault source and tracking a congestion link, mapping a service influence range through a logic address, extracting link security risk characteristics to calculate a correction coefficient, adjusting an abnormal threshold value and evaluating a positioning difficulty level, finally searching similar cases to generate a repair instruction, feeding back adjustment model parameters, and obtaining optimized stable configuration. The method can realize accurate recognition and closed-loop optimization of the machine room faults.

Inventors

ZHONG RUI
WEI ZIKAI
HUANG SHUMEI
WANG LIJUN
LI RUI
WANG JIAYUN
YU WEI

Assignees

成都信息工程大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The machine room fault node identification method based on the network topology is characterized by comprising the following steps of: Acquiring real-time network topology data and link transmission indexes, constructing a weighted directed graph by using the network topology data and the link transmission indexes, and extracting time sequence characteristics according to the weighted directed graph to obtain a network structure evolution sequence; extracting historical rule data from the network structure evolution sequence and performing time sequence space projection to obtain a node state vector; Calculating Euclidean distance between the node state vector and a preset historical mean value to obtain a deviation value, if the deviation value exceeds a preset deviation threshold value, extracting a deviation feature according to the node state vector, and carrying out local subgraph search according to the deviation feature to obtain a potential abnormal region; Performing node clustering grouping on the potential abnormal region to obtain a grouping dominant index, and performing propagation influence evaluation according to the grouping dominant index to obtain a fault identification priority sequence; Positioning according to the fault identification priority sequence to obtain an initial fault source, carrying out path simulation tracking on the initial fault source to obtain a marked congestion link, carrying out service entity matching according to the marked congestion link to obtain a service bearing entity, and carrying out risk quantification evaluation on the service bearing entity to obtain a service continuous influence range; Extracting link security risk characteristics in the service continuous influence range, calculating a correction coefficient according to the link security risk characteristics, performing abnormal threshold correction by using the correction coefficient to obtain a corrected abnormal threshold, and performing positioning ambiguity quantitative evaluation based on the corrected abnormal threshold to obtain a final positioning difficulty level; And carrying out comprehensive decision of a repair scheme according to the final positioning difficulty level to obtain a repair instruction, executing the repair instruction, and carrying out parameter feedback adjustment to obtain the optimized network stable configuration.
2. The machine room fault node identification method based on network topology according to claim 1, wherein the constructing a weighted directed graph by using the network topology data and the link transmission index, and extracting time sequence features according to the weighted directed graph to obtain a network structure evolution sequence comprises: performing message analysis on the network topology data to obtain a physical connection relation, and constructing an adjacency matrix according to the physical connection relation to obtain an adjacency matrix reflecting the physical connection relation; Carrying out weighted fusion on the link transmission indexes to obtain link comprehensive weights, and carrying out assignment processing on the adjacent matrixes according to the link comprehensive weights to obtain a weighted directed graph; Performing graph embedding processing on the weighted directed graph by adopting a Deep Walk algorithm to obtain a graph feature vector; and obtaining a historical diagram feature vector, calculating Euclidean distance between the diagram feature vector and the historical diagram feature vector, and marking mutation points according to the Euclidean distance to obtain a network structure evolution sequence.
3. The method for identifying a machine room fault node based on network topology according to claim 1, wherein the steps of extracting historical rule data from the network structure evolution sequence and performing time-sequence space projection to obtain a node state vector comprise: carrying out sliding window segmentation on the network structure evolution sequence to obtain a continuous topology snapshot set; Extracting adjacency relation weight change values of adjacent moments from the continuous topology snapshot set to obtain historical rule data; And inputting the history rule data into a pre-trained long-term and short-term memory network model for time sequence embedding processing to obtain a node state vector.
4. The method for identifying a machine room fault node based on network topology according to claim 1, wherein if the deviation value exceeds a preset deviation threshold, extracting a deviation feature according to the node state vector, and performing local subgraph search according to the deviation feature to obtain a potential abnormal region, comprises: If the deviation value exceeds a preset deviation threshold, performing dimension decomposition on the node state vector by adopting a principal component analysis method to obtain a deviation characteristic; Mapping the deviation features to the weighted directed graph to obtain abnormal mode description; and matching the weighted directed graph according to the abnormal pattern description, wherein the matching meets the preset searching radius condition, so as to obtain an abnormal substructure, and determining the coverage range of the abnormal substructure as a potential abnormal region containing the node state vector.
5. The machine room fault node identification method based on network topology according to claim 1, wherein the step of grouping the potential abnormal areas by node cluster to obtain a group dominant index, and performing propagation influence evaluation according to the group dominant index to obtain a fault identification priority sequence comprises the steps of: Constructing a feature similarity matrix according to the node state vectors in the potential abnormal region, and inputting the feature similarity matrix into a preset density clustering model for grouping to obtain an abnormal feature cluster group containing topological adjacent density; Calculating the variance contribution rate of each index in the abnormal feature cluster group, and carrying out index screening according to the variance contribution rate to obtain a group dominant index containing numerical strength; Carrying out weighted summation calculation according to the numerical strength of the grouping dominant index and the topological adjacent density of the abnormal feature cluster grouping to obtain a fault propagation influence value; And according to the fault propagation influence value, descending order arrangement is carried out on each abnormal feature cluster group, and a fault identification priority sequence containing the abnormal feature clusters is obtained.
6. The method for identifying machine room fault nodes based on network topology according to claim 1, wherein positioning according to the fault identification priority sequence to obtain an initial fault source, performing path simulation tracking on the initial fault source to obtain a marked congestion link, performing service entity matching according to the marked congestion link to obtain a service bearing entity, and performing risk quantification evaluation on the service bearing entity to obtain a service continuous influence range, comprises: Mapping physical coordinates of an abnormal feature cluster with the highest priority in the fault identification priority sequence to obtain physical coordinates, and determining a node corresponding to the physical coordinates as an initial fault source; carrying out dynamic path tracking on the initial fault source to obtain a marked congestion link; Extracting a virtual local area network identifier in the marked congestion link, and searching a service bearing entity which is matched with the virtual local area network identifier and contains a redundant state in a pre-established service configuration library; and carrying out fault probability aggregation calculation on the redundant state of the service bearing entity to obtain an accumulated shutdown probability, and determining a service continuous influence range if the accumulated shutdown probability exceeds a preset service risk threshold.
7. The method for identifying a machine room fault node based on network topology according to claim 1, wherein the extracting the link security risk feature within the service continuous influence range, calculating a correction coefficient according to the link security risk feature, performing abnormal threshold correction by using the correction coefficient to obtain a corrected abnormal threshold, and performing positioning ambiguity quantitative evaluation based on the corrected abnormal threshold to obtain a final positioning difficulty level comprises: acquiring link security risk characteristics in the continuous influence range of the service; calculating an abnormal fluctuation amplitude according to the link security risk characteristics, and performing least square curve fitting on the abnormal fluctuation amplitude to obtain a deviation accumulation distribution model; Matching corresponding protection coefficients according to the link encryption level of the link security risk characteristics, and performing mapping processing according to the protection coefficients and the deviation accumulation distribution model to obtain threshold correction coefficients; And carrying out proportional adjustment on a preset original node abnormal threshold according to the threshold correction coefficient to obtain a corrected node abnormal threshold, and carrying out ambiguity calculation according to the corrected node abnormal threshold to obtain a final positioning difficulty level.
8. The method for identifying a machine room fault node based on network topology according to claim 1, wherein the performing a repair scheme comprehensive decision according to the final positioning difficulty level to obtain a repair instruction, executing the repair instruction and performing parameter feedback adjustment to obtain an optimized network stable configuration comprises: Performing similar case matching in a pre-established historical database according to the final positioning difficulty level to obtain a similar abnormal mode case and a corresponding repairing time-consuming weight; Performing execution cost quantization prediction according to the repair time consuming weight to obtain an optimal operation logic, and converting the optimal operation logic into a repair instruction; executing the repair instruction, acquiring a real-time performance index, and calculating a feedback deviation value between the real-time performance index and a preset damage stopping target; and extracting node weight parameters from the weighted directed graph, and performing iterative fine adjustment on the node weight parameters according to the feedback deviation value to obtain the optimized network stability configuration.
9. The network topology-based machine room fault node identification method of claim 7, wherein the link security risk feature comprises a security link topology attribute and a link encryption level.
10. The utility model provides a computer lab trouble node identification system based on network topology which characterized in that includes: The topology evolution monitoring module is used for acquiring real-time network topology data and link transmission indexes, constructing a weighted directed graph by utilizing the network topology data and the link transmission indexes, and extracting time sequence characteristics according to the weighted directed graph to obtain a network structure evolution sequence; The time sequence state quantization module is used for extracting historical rule data from the network structure evolution sequence and performing time sequence space projection to obtain a node state vector; The abnormal mode defining module is used for calculating Euclidean distance between the node state vector and a preset historical mean value to obtain a deviation value, extracting deviation features according to the node state vector if the deviation value exceeds a preset deviation threshold value, and carrying out local subgraph search according to the deviation features to obtain a potential abnormal region; the priority assessment module is used for carrying out node clustering grouping on the potential abnormal region to obtain a grouping dominant index, and carrying out propagation influence assessment according to the grouping dominant index to obtain a fault identification priority sequence; The business impact assessment module is used for positioning according to the fault identification priority sequence to obtain an initial fault source, carrying out path simulation tracking on the initial fault source to obtain a marked congestion link, carrying out business entity matching according to the marked congestion link to obtain a business bearing entity, and carrying out risk quantification assessment on the business bearing entity to obtain a business continuous impact range; The threshold dynamic correction module is used for extracting link security risk characteristics in the service continuous influence range, calculating correction coefficients according to the link security risk characteristics, performing abnormal threshold correction by using the correction coefficients to obtain corrected abnormal thresholds, and performing positioning ambiguity quantitative evaluation based on the corrected abnormal thresholds to obtain final positioning difficulty levels; And the closed-loop repair optimization module is used for carrying out comprehensive decision of a repair scheme according to the final positioning difficulty level to obtain a repair instruction, executing the repair instruction and carrying out parameter feedback adjustment to obtain the optimized network stable configuration.

Description

Machine room fault node identification method and system based on network topology Technical Field The invention relates to the technical field of network operation and maintenance and communication monitoring, in particular to a machine room fault node identification method and system based on network topology. Background At present, in the field of modern information technology, a network machine room is used as a core hub for connecting various devices and services, and the stable operation of the network machine room is directly related to business continuity and data security of enterprises. With the penetration of digital transformation, the computer room network structure is increasingly complex and presents the characteristic of dynamic evolution, and the rapid and accurate identification of the fault node becomes an indispensable key link in a fault prediction and health management system. Accurate fault location can not only greatly shorten service interruption time, but also change passive countermeasures into active defenses in front of sudden problems by capturing fine offsets of network structures, thereby guaranteeing efficient operation of a large-scale data center. In the prior art, the traditional means based on preset rules or simple index monitoring are generally adopted, the operation state data of equipment such as a switch and the like are collected in real time through a gateway or monitoring software and are compared with a fixed threshold value, when a specific index (such as flow or packet loss rate) exceeds a preset limit, an alarm is triggered, and then the physical state of related links and equipment is verified one by an operation and maintenance personnel according to alarm information in combination with a manual investigation mode so as to lock a fault source. However, because the traditional method excessively depends on static rules, dynamic adjustment of connection relations among network nodes and linkage effects of abnormal propagation cannot be fully considered, and particularly when multi-node performance chain degradation is caused by a certain key equipment fault, deep abnormal modes hidden in a structure evolution track are difficult to capture effectively, and misjudgment or missed judgment is often caused. In summary, the problem of low fault node identification accuracy and poor troubleshooting efficiency in the dynamic environment exists in the prior art. Disclosure of Invention The invention provides a machine room fault node identification method and system based on network topology, which are used for solving the problems of low fault node identification accuracy and poor troubleshooting efficiency in a dynamic environment. In order to solve the above technical problems, the present invention provides a machine room fault node identification method based on network topology, including: Acquiring real-time network topology data and link transmission indexes, constructing a weighted directed graph by using the network topology data and the link transmission indexes, and extracting time sequence characteristics according to the weighted directed graph to obtain a network structure evolution sequence; extracting historical rule data from the network structure evolution sequence and performing time sequence space projection to obtain a node state vector; Calculating Euclidean distance between the node state vector and a preset historical mean value to obtain a deviation value, if the deviation value exceeds a preset deviation threshold value, extracting a deviation feature according to the node state vector, and carrying out local subgraph search according to the deviation feature to obtain a potential abnormal region; Performing node clustering grouping on the potential abnormal region to obtain a grouping dominant index, and performing propagation influence evaluation according to the grouping dominant index to obtain a fault identification priority sequence; Positioning according to the fault identification priority sequence to obtain an initial fault source, carrying out path simulation tracking on the initial fault source to obtain a marked congestion link, carrying out service entity matching according to the marked congestion link to obtain a service bearing entity, and carrying out risk quantification evaluation on the service bearing entity to obtain a service continuous influence range; Extracting link security risk characteristics in the service continuous influence range, calculating a correction coefficient according to the link security risk characteristics, performing abnormal threshold correction by using the correction coefficient to obtain a corrected abnormal threshold, and performing positioning ambiguity quantitative evaluation based on the corrected abnormal threshold to obtain a final positioning difficulty level; And carrying out comprehensive decision of a repair scheme according to the final positioning difficulty level to obtain a rep