CN-121981231-A - Automatic data blood relationship tracking method and system based on AI data center
Abstract
The invention relates to the technical field of electric data processing, in particular to a data blood relationship automatic tracking method and system based on an AI data middle stage, comprising the steps of acquiring metadata in the AI data middle stage and analyzing a task log of a calculation engine to construct an initial directed acyclic graph; the initial directed acyclic graph comprises data nodes and explicit blood rims connected with the data nodes, and a flow entropy coupling index between any two data nodes is calculated. According to the invention, a flow entropy coupling index based on updating time correlation and a logic evolution divergence based on key entity distribution similarity are constructed, and a graph comparison learning model is combined, so that the implicit dependence of an upstream node and a downstream node is automatically identified under the condition of lacking an explicit SQL grammar structure, and the complete tracking of the end-to-end full-link data blood edges under a complex AI characteristic engineering scene is successfully realized.
Inventors
- LI CAIQING
- Qin Chunjia
- HUANG YU
- ZENG MINHUA
Assignees
- 广东知一数据有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260327
Claims (10)
- 1. The automatic tracking method based on the data blood-edge relationship of the AI data center station is characterized by comprising the steps of obtaining metadata in the AI data center station and analyzing a task log of a calculation engine to construct an initial directed acyclic graph, wherein the initial directed acyclic graph comprises data nodes and an explicit blood-edge connected with the data nodes; The method comprises the steps of calculating logic evolution divergence between any two data nodes, wherein the logic evolution divergence represents consistency of frequency differences of common key entities contained in sampling data from the two nodes, training a prediction model based on graph comparison and learning, taking node pairs with flow entropy coupling indexes larger than a first threshold and logic evolution divergence smaller than a second threshold as positive samples and the rest node pairs as negative samples, predicting unconnected node pairs in the initial directed acyclic graph by using the trained prediction model, and determining whether to add hidden blood rims in the initial directed acyclic graph according to a prediction result.
- 2. The automatic tracking method for data blood relationship based on AI data center station of claim 1, wherein the calculating method for update strength comprises: The method comprises the steps of obtaining the updated byte quantity of each data node in each time slice in a specified observation period, carrying out normalization processing on the updated byte quantity to obtain the updated strength with the value range of [0,1], and if the node is not updated in the corresponding time slice, setting the updated strength to be 0.
- 3. The AI-data-center-based data blood relationship automatic tracking method of claim 1, wherein the calculating of the logical evolution divergence further comprises: And automatically identifying an ID field from the data subset by using a regular expression and a high-radix feature identification technology as a key entity.
- 4. The automatic tracking method of data blood relationship based on AI data center station of claim 1, wherein the training process of the prediction model includes: vector representations are generated for each data node using a graph neural network encoder, and a contrast loss function is constructed that minimizes the loss function such that vector representations of positive sample pairs are closer together in embedding space and vector representations of negative sample pairs are farther apart.
- 5. The automatic tracking method of data blood relationship based on AI data center of claim 4, wherein constructing a contrast loss function comprises: The contrast loss function is inversely related to the blood margin confidence scores of the node pairs and is positively related to the sum of indexes of vector cosine similarity of the node vectors and all negative sample nodes.
- 6. The automatic tracking method of data blood-edge relation based on AI data center of claim 5, wherein a blood-edge confidence score of a node pair is calculated, the blood-edge confidence score being positively correlated with a flow entropy coupling index and negatively correlated with a logical evolution divergence.
- 7. The automatic tracking method of data blood relationship based on AI data center of claim 1, wherein determining whether to add a hidden blood margin to the initial directed acyclic graph according to a prediction result comprises: and in response to the predicted result being higher than a set threshold, adding a recessive blood margin to the initial directed acyclic graph.
- 8. The automatic tracking method of data blood relationship based on AI data center station of claim 1, wherein the constructing of the initial directed acyclic graph comprises: The method comprises the steps of subscribing metadata catalogues of stations in AI data, defining a data table or an internal field of the metadata catalogues as data nodes, monitoring task logs of a computing engine layer, extracting a job ID and defining the job ID as a processing node, analyzing SQL sentences in the task logs by utilizing an SQL analysis engine, and establishing an explicit blood margin between the corresponding data nodes according to a data flow.
- 9. The AI-data-center-based data blood relationship automatic tracking method of claim 1, further comprising: And providing a visual interface, responding to a query request of a user for any data node, and displaying a full-link blood margin map comprising the explicit blood margin and the implicit blood margin.
- 10. An automatic data blood relationship tracking system based on AI data center station, comprising a processor and a memory, wherein the memory stores a computer program, and the processor executes the computer program to implement the automatic data blood relationship tracking method based on AI data center station as set forth in any one of claims 1 to 9.
Description
Automatic data blood relationship tracking method and system based on AI data center Technical Field The invention relates to the technical field of electric data processing. More particularly, the invention relates to a data blood relationship automatic tracking method and system based on an AI data center. Background In the construction and operation of the current enterprise-level AI data center, the data blood-margin map is an important basic tool for guaranteeing data quality, tracing data sources and evaluating the influence range of data change. Along with the improvement of service complexity, massive multi-source heterogeneous data are gathered by a data center, and a system generally utilizes a regular expression and an SQL analysis engine to carry out static analysis on standard SQL sentences in the logs through subscribing task logs of a computing engine (such as APACHE SPARK or Flink) so as to construct an explicit dependency relationship between data table nodes and processing task nodes, thereby displaying a data circulation path in a visual interface. However, in the practical AI application development and data processing scenario, the technical means that rely on SQL static parsing only often has difficulty in meeting the requirements of full link tracking. This is because in AI data kiosks, the data processing links not only contain standard SQL queries, but also involve a large number of complex feature engineering, cleaning transformations, and model training steps, which are implemented using custom scripts (i.e., "black box operators") written in a programming language such as Python, scala, etc. Under such a practical scenario, after the data is processed by the black box operator, the table structure and the field names of the data are often changed fundamentally, such as from the service field to the feature vector, and the resolvable SQL grammar structure is lacking. The existing analysis technology cannot penetrate through the black box logic to identify the dependency relationship of the upstream node and the downstream node, so that a large amount of isolated data islands exist in the generated blood edge map due to unrecognizable association, the blood edge links are broken in key links such as feature engineering, and the end-to-end complete tracing cannot be realized. Disclosure of Invention The invention provides a data blood edge relation automatic tracking method and system based on an AI data center table, and aims to solve the problems that the existing analysis technology in the related technology cannot penetrate through black box logics to identify the dependency relation of upstream and downstream nodes, so that a large number of isolated data islands exist in the generated blood edge map due to unrecognizable association, the blood edge links are broken in key links such as feature engineering, and the end-to-end complete tracing cannot be realized. In a first aspect, the invention provides an automatic data blood relationship tracking method based on an AI data center, which comprises the steps of obtaining metadata in the AI data center and analyzing task logs of a calculation engine to construct an initial directed acyclic graph, wherein the initial directed acyclic graph comprises data nodes and explicit blood edges connected with the data nodes, calculating a flow entropy coupling index between any two data nodes, wherein the flow entropy coupling index is positively correlated with the sum of products of update strengths of two nodes in a plurality of time slices in a set observation period and negatively correlated with the logarithm of delay time periods of update operation of the two nodes, calculating a logic evolution dispersion degree between any two data nodes, wherein the logic evolution dispersion degree is calculated based on frequency difference of common key entities contained in sampling data of the two nodes, using a variation of KL dispersion degree to obtain the initial directed acyclic graph, using a node pair with the flow entropy coupling index being larger than a first threshold and the logic evolution dispersion degree being smaller than a second threshold as a positive sample based on a graph contrast learning prediction model, using the rest node pair as a negative sample, and using the trained prediction model to determine whether the initial directed acyclic graph is not connected with the initial directed acyclic graph according to a prediction result. By fusing the flow entropy coupling index and the logic evolution divergence of the physical side and combining the graph contrast learning technology, the method can accurately dig out the implicit dependency relationship between the upstream data node and the downstream data node under the condition of lacking an explicit grammar structure, thereby connecting isolated data islands and realizing the end-to-end full-link blood margin automatic tracking under the complex AI data process