CN-121979716-A - End-to-end metadata closed-loop management method and system

CN121979716ACN 121979716 ACN121979716 ACN 121979716ACN-121979716-A

Abstract

The application relates to the technical field of data management, provides an end-to-end metadata closed-loop management method and system, and solves the problem of high autonomous operation and maintenance difficulty in the prior art. The method comprises the steps of obtaining call chain topology data, time sequence data and temperature and humidity data, carrying out space-time alignment on the call chain topology data and the time sequence data to obtain space-time tensors, extracting physical characteristics of the temperature and humidity data to obtain physical vectors, inputting the physical vectors into a space-time graph neural network for dynamic fusion to generate a root cause analysis report, carrying out time sequence cause and effect inference by using a dynamic Bayesian network based on the root cause analysis report to obtain a cause and effect structure diagram, carrying out decision planning by using Monte Carlo tree search based on the cause and effect structure diagram to obtain candidate action sequences, calculating stability risk coefficients, resource consumption cost and system recovery probability, carrying out collaborative evaluation to obtain an optimal action sequence, and issuing an execution engine. The application can realize the end-to-end autonomous operation and maintenance and accurate intervention of the micro-service performance bottleneck.

Inventors

YANG LIU
PEI NING
ZHANG HAOYU
Liao Chenguang
Si Bingnan
WANG WEI
ZHAO JUNQING

Assignees

中铁电气化局集团有限公司

Dates

Publication Date: 20260505
Application Date: 20260331

Claims (10)

1. An end-to-end metadata closed-loop management method, comprising: Acquiring call chain topology data of micro services, time sequence data of each service instance and temperature and humidity data of a data center cabinet level; Performing space-time alignment processing on the call chain topology data and the time sequence data to obtain space-time tensors, and performing physical feature extraction on the temperature and humidity data to obtain physical vectors; Inputting the space-time tensor and the physical vector into a space-time graph neural network for dynamic fusion, and generating a root cause analysis report; Based on the root cause analysis report, performing time sequence causal inference by using a dynamic Bayesian network to generate a causal structure diagram; Based on the causal structure diagram, carrying out decision planning through a Monte Carlo tree search algorithm to generate a plurality of candidate action sequences, and calculating corresponding stability risk coefficients, resource consumption cost and system recovery probability according to each candidate action sequence; And carrying out collaborative evaluation operation on each candidate action sequence by combining the stability risk coefficient, the resource consumption cost and the system recovery probability to obtain an optimal action sequence, and issuing the optimal action sequence to an execution engine to finish management operation on metadata.
2. The end-to-end metadata closed-loop management method according to claim 1, wherein the generating a causal structure map based on the root cause analysis report by using a dynamic bayesian network to perform time-series causal inference comprises: Respectively defining a service node, a physical node and a performance index in the root cause analysis report as a first variable node, a second variable node and a third variable node in the dynamic Bayesian network; Determining initial dependency relationships among the first variable nodes, the second variable nodes and the third variable nodes according to contribution degree distribution vectors in the root cause analysis report, and constructing a network structure of the dynamic Bayesian network in an initial time slice according to the initial dependency relationships; Dividing a time axis into a plurality of continuous time slices, constructing a time sequence slicing network of the dynamic Bayesian network for each time slice by taking the network structure as a template, and connecting the plurality of time sequence slicing networks according to a time sequence to form a complete topological structure of the dynamic Bayesian network; Performing iterative message transfer in the complete topological structure by adopting a belief propagation algorithm to update the conditional probability distribution of each of the first variable node, the second variable node and the third variable node, and calculating causal strength coefficients among the variable nodes according to the updated conditional probability distribution; And screening causal paths which are larger than or equal to a preset threshold value from the complete topological structure according to the causal intensity coefficient, and collecting all the causal paths to form a causal structure diagram.
3. The end-to-end metadata closed-loop management method according to claim 2, wherein said constructing a network structure of the dynamic bayesian network at an initial time slice according to the initial dependency relationship comprises: Calculating mutual information values between every two variable nodes according to the contribution distribution vector, and taking the mutual information values as association measurement between the variable nodes; Carrying out causal direction inference on each variable node by taking the association metric as a priori weight and adopting a causal discovery algorithm based on an additive noise model to generate an edge set with candidate directions; performing iterative optimization in the search space by taking the edge set as the search space through a tabu search algorithm, wherein each iteration starts from the current structure, performing neighborhood operations of adding edges, deleting edges or reversing edges, and recording the recently performed neighborhood operations by utilizing a tabu table; stopping iteration when the preset iteration times are reached, and taking the structure of the last generation as the network structure of the dynamic Bayesian network in the initial time slice.
4. The end-to-end metadata closed-loop management method according to claim 1, wherein the inputting the space-time tensor and the physical vector into a space-time graph neural network for dynamic fusion, generating a root cause analysis report, comprises: Inputting the space-time tensor into a space-time graph neural network, capturing the space dependence relationship among service nodes by a graph convolution layer of the space-time graph neural network, outputting space characteristics, capturing the time dependence relationship of each service node by a time sequence convolution layer of the space-time graph neural network, outputting time characteristics, and splicing the space characteristics and the time characteristics to obtain a space-time characteristic vector of each service node; mapping the physical vector to a corresponding service node according to a mapping relation between a preset service instance and a physical host to obtain a physical feature vector of each service node; Taking the space-time feature vector as a query, taking the physical feature vector as a key and a value, calculating the cross-modal attention weight of each service node through a multi-head attention layer of the space-time graph neural network, and weighting the physical feature vector according to the cross-modal attention weight to obtain the weighted physical feature of each service node; Dynamically calculating a fusion coefficient by utilizing a fusion layer of the space-time diagram neural network through a learnable gating weight vector, and carrying out weighted summation on the space-time feature vector and the weighted physical feature according to the fusion coefficient to obtain a fusion feature vector of each service node; And carrying out nonlinear processing on the fusion feature vector by utilizing a softmax function through a full-connection layer of the space-time diagram neural network, generating contribution degree distribution vectors corresponding to multiple preset root cause types for each service node, and collecting contribution degree distribution vectors of all the preset root cause types to form a root cause analysis report.
5. The end-to-end metadata closed-loop management method according to claim 1, wherein the generating a plurality of candidate action sequences by decision planning through a monte carlo tree search algorithm based on the causal structure diagram comprises: taking the system state of the current service cluster as a root node, and taking a causal path in the causal structure diagram as a state transition constraint when the root node is expanded to construct a Monte Carlo search tree; At each search node of the Monte Carlo search tree, determining a plurality of initial actions executable by each search node according to corresponding time lag parameters and confidence intervals in the causal structure map; And iteratively executing expansion operation from the root node until the preset iteration times are reached, wherein each expansion operation comprises the following steps: Simulating the state transition after executing the initial action corresponding to the current node according to the causal intensity coefficient in the causal structure diagram by adopting an upper confidence boundary algorithm, and generating a new child node; When each expansion operation reaches a preset simulation depth, setting random selection probability of each initial action from the current child node according to the causal intensity coefficient, randomly selecting the initial action until reaching a leaf node according to the random selection probability to obtain an accumulated return value of the simulation, and back-propagating the accumulated return value to update the average return value of each node; and after the preset iteration times are reached, selecting a plurality of candidate action sequences of which the average return values meet preset return conditions from all the child nodes corresponding to the root node.
6. The end-to-end metadata closed-loop management method according to claim 1, wherein said performing a collaborative evaluation operation on each candidate action sequence to obtain an optimal action sequence in combination with the stability risk coefficient, the resource consumption cost and the system recovery probability comprises: taking the stability risk coefficient, the resource consumption cost and the system recovery probability as three evaluation dimensions corresponding to the candidate action sequences to construct an evaluation vector of each candidate action sequence; inputting the evaluation vector into a pre-constructed multi-objective optimization model, wherein the multi-objective optimization model takes the minimum stability risk coefficient, the minimum resource consumption cost and the maximum system recovery probability as optimization targets; Modeling a solving process of the multi-target optimization model into a multi-target Markov decision process, and searching a candidate solution set meeting the pareto optimal condition in a state space of the multi-target Markov decision process through non-dominant sorting by adopting a pareto front method according to the optimization target; And from the candidate solution set, weighting and summing each candidate solution according to a preset decision preference weight to obtain a preference score of each candidate solution, and selecting a candidate action sequence corresponding to the candidate solution with the highest preference score as an optimal action sequence.
7. The end-to-end metadata closed-loop management method according to claim 1, wherein performing space-time alignment processing on the call chain topology data and the time sequence data to obtain a space-time tensor comprises: Extracting the hierarchical depth and the calling relation of each service instance in a service calling chain according to the calling chain topology data, and generating a position coding vector of each service instance according to the hierarchical depth and the calling relation; using time sequence data of each service instance corresponding to each time point as a first dimension, using each service instance as a second dimension, and using the position coding vector as a third dimension to construct a third-order tensor; The numerical distribution of different service instances in the third-order tensor at the same time point is adjusted through a preset space constraint item, so that the adjusted third-order tensor is obtained; and performing tensor Tucker decomposition on the adjusted third-order tensor, reserving a core tensor and three factor matrixes, and alternately optimizing the core tensor and the three factor matrixes to generate a space-time tensor by taking a minimum reconstruction error as a target.
8. An end-to-end metadata closed-loop management system, comprising: the acquisition module is used for acquiring call chain topology data of the micro service, time sequence data of each service instance and temperature and humidity data of a data center cabinet level; The extraction module is used for carrying out space-time alignment processing on the call chain topology data and the time sequence data to obtain space-time tensors, and carrying out physical feature extraction on the temperature and humidity data to obtain physical vectors; The fusion module is used for inputting the space-time tensor and the physical vector into a space-time diagram neural network for dynamic fusion to generate a root cause analysis report; The inference module is used for carrying out time sequence causal inference by utilizing a dynamic Bayesian network based on the root cause analysis report to generate a causal structure diagram; The calculation module is used for carrying out decision planning through a Monte Carlo tree search algorithm based on the causal structure diagram, generating a plurality of candidate action sequences, and calculating corresponding stability risk coefficients, resource consumption cost and system recovery probability according to each candidate action sequence; And the evaluation module is used for carrying out cooperative evaluation operation on each candidate action sequence by combining the stability risk coefficient, the resource consumption cost and the system recovery probability to obtain an optimal action sequence, and issuing the optimal action sequence to an execution engine so as to finish management operation on metadata.
9. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the steps of an end-to-end metadata closed-loop management method according to any of claims 1 to 7 when executing said computer program.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, the computer program being capable of implementing an end-to-end metadata closed-loop management method according to any one of claims 1 to 7 when executed by a processor.

Description

End-to-end metadata closed-loop management method and system Technical Field The present application relates to the field of data management technologies, and in particular, to an end-to-end metadata closed-loop management method and system. Background In the scene that the unified data base externally provides data interface service, the performance of the micro service system is monitored in a full link mode and the end-to-end autonomous operation and maintenance management is carried out, and the method has important application prospects for guaranteeing service level agreements and improving system stability. At present, the micro-service treatment field mainly relies on traditional observability technologies such as distributed tracking, index monitoring and log acquisition, and abnormality detection and alarm are performed through a preset threshold value or a simple association rule. However, the existing methods have the defects that on one hand, the traditional monitoring means are difficult to effectively fuse multi-mode data such as call chain topology, dynamic load and underlying physical environment, and the performance bottleneck under complex dependency relationship cannot be accurately captured, on the other hand, the existing schemes are stopped at root cause alarms, and lack of autonomous decision and closed loop execution capability on the optimal intervention strategy, so that the fault recovery efficiency is low, and the degree of automation is insufficient. Disclosure of Invention The application aims to provide an end-to-end metadata closed-loop management method and system, which are used for solving the problem of high autonomous operation and maintenance difficulty in the prior art. In order to solve the above technical problems, in a first aspect, the present application provides an end-to-end metadata closed-loop management method, including: Acquiring call chain topology data of micro services, time sequence data of each service instance and temperature and humidity data of a data center cabinet level; Performing space-time alignment processing on the call chain topology data and the time sequence data to obtain space-time tensors, and performing physical feature extraction on the temperature and humidity data to obtain physical vectors; Inputting the space-time tensor and the physical vector into a space-time graph neural network for dynamic fusion, and generating a root cause analysis report; Based on the root cause analysis report, performing time sequence causal inference by using a dynamic Bayesian network to generate a causal structure diagram; Based on the causal structure diagram, carrying out decision planning through a Monte Carlo tree search algorithm to generate a plurality of candidate action sequences, and calculating corresponding stability risk coefficients, resource consumption cost and system recovery probability according to each candidate action sequence; And carrying out collaborative evaluation operation on each candidate action sequence by combining the stability risk coefficient, the resource consumption cost and the system recovery probability to obtain an optimal action sequence, and issuing the optimal action sequence to an execution engine to finish management operation on metadata. Optionally, the generating a causal structure diagram based on the root cause analysis report and by using a dynamic bayesian network to perform time sequence causal inference includes: Respectively defining a service node, a physical node and a performance index in the root cause analysis report as a first variable node, a second variable node and a third variable node in the dynamic Bayesian network; Determining initial dependency relationships among the first variable nodes, the second variable nodes and the third variable nodes according to contribution degree distribution vectors in the root cause analysis report, and constructing a network structure of the dynamic Bayesian network in an initial time slice according to the initial dependency relationships; Dividing a time axis into a plurality of continuous time slices, constructing a time sequence slicing network of the dynamic Bayesian network for each time slice by taking the network structure as a template, and connecting the plurality of time sequence slicing networks according to a time sequence to form a complete topological structure of the dynamic Bayesian network; Performing iterative message transfer in the complete topological structure by adopting a belief propagation algorithm to update the conditional probability distribution of each of the first variable node, the second variable node and the third variable node, and calculating causal strength coefficients among the variable nodes according to the updated conditional probability distribution; And screening causal paths which are larger than or equal to a preset threshold value from the complete topological structure according to the causal intensity