Search

CN-121542311-B - Data weight quantization method and system based on operator-level blood margin analysis

CN121542311BCN 121542311 BCN121542311 BCN 121542311BCN-121542311-B

Abstract

The application discloses a data weight quantization method and a data weight quantization system based on operator-level blood-edge analysis, wherein the method analyzes SQL sentences to be analyzed corresponding to a target scene into an abstract syntax tree; traversing abstract syntax tree, extracting multiple operators, constructing dependency relationships among the multiple operators, constructing an operator directed acyclic graph based on the multiple operators and the dependency relationships, constructing an operator-level blood edge storage model containing the association relationships among operator nodes, table nodes and nodes based on the operator directed acyclic graph, wherein the attribute of the association relationship contains an edge weight value, the edge weight value is obtained through data traffic weight, operator complexity weight, dependency relationship weight and path depth weight weighted summation, and normalizing the edge weight value to obtain a normalized edge weight value so as to realize data weight quantization. The application can provide fine granularity and quantifiable blood-margin analysis support for the positioning of key operators in data quality tracing and query optimization in data processing.

Inventors

  • REN XIAOLI
  • WANG YAZHEN
  • LI XIAOYONG
  • SHAO CHENGCHENG
  • ZHU XIANG
  • REN KAIJUN
  • DENG KEFENG
  • LIN JIARUN
  • TAN JIAMING
  • CHEN XINYU

Assignees

  • 中国人民解放军国防科技大学

Dates

Publication Date
20260512
Application Date
20260122

Claims (9)

  1. 1. A method for quantifying data weight based on operator-level blood-edge analysis, the method comprising: Analyzing the SQL statement to be analyzed corresponding to the target scene into an abstract syntax tree; traversing the abstract syntax tree, extracting various operators, and constructing the dependency relationship among the various operators; constructing an operator directed acyclic graph based on the plurality of operators and the dependency relationship; Based on the operator directed acyclic graph, an operator-level blood edge storage model comprising operator nodes, table nodes and association relations among the nodes is constructed, wherein the attribute of the association relation comprises an edge weight value, and the edge weight value is obtained by weighting and summing data flow weights, operator complexity weights, dependency relation weights and path depth weights, and the method comprises the following steps: Calculating data flow weight according to the obtained operator output data line number and operator input data line number; Presetting a complexity score of each operator, and taking the complexity score as an operator complexity weight; calculating importance scores of each operator in the operator directed acyclic graph through PageRank, and calculating dependency weights based on the importance scores; calculating path depth weight according to the number of upstream tables and the number of downstream tables corresponding to each operator, wherein the number of the upstream tables is the number of tables from the input table to the operator, and the number of the downstream tables is the number of tables from the operator to the output table; carrying out weighted summation on the data flow weight, the operator complexity weight, the dependency relation weight and the path depth weight to obtain an edge weight value; Normalizing the edge weight value to obtain a normalized edge weight value so as to realize data weight quantization.
  2. 2. The method for data weight quantization based on operator-level blood-edge analysis according to claim 1, wherein said constructing an operator directed acyclic graph based on said plurality of operators and said dependency relationship comprises: Initializing an empty operator list and an empty dependency graph, wherein the dependency graph adopts an adjacency list form, keys of the adjacency list are operator unique IDs, and values of the adjacency list are operator precursor operator ID lists; Adding each operator into the operator list, and adding an operator unique ID with a dependency relationship with the current operator into a precursor operator ID list in the dependency relationship graph to construct an initial operator directed acyclic graph; And carrying out loop detection on the initial operator directed acyclic graph, correcting the dependency relationship between the current operator pair connected with the loop if the loop is detected in the initial operator directed acyclic graph, obtaining the corrected dependency relationship, adding an operator unique ID corresponding to the corrected dependency relationship to the precursor operator ID list, and constructing the operator directed acyclic graph when the loop is not formed in the initial operator directed acyclic graph.
  3. 3. The method for data weight quantization based on operator-level blood edge analysis according to claim 1, wherein constructing an operator-level blood edge storage model including operator nodes, table nodes, and association relations between the nodes based on the operator directed acyclic graph comprises: Constructing operator nodes and table nodes based on the operator directed acyclic graph, wherein the attributes of the operator nodes comprise an operator unique identifier, an operator type enumeration value, an SQL fragment of an operator, an input data source list, an output field mapping, a complexity grading and an execution sequence, and the attributes of the table nodes comprise a table unique identifier, a complete naming of a table and a table field list; Constructing an association relation between nodes based on the operator directed acyclic graph, wherein the association relation comprises a dependency relation between operator nodes, a data flow relation between table nodes and a data flow relation between operator nodes; Establishing a single attribute index on the operator type enumeration value, the complexity score and the execution sequence of the operator nodes; and constructing an operator-level blood margin storage model based on the operator nodes, the table nodes, the association relation among the nodes and the single attribute index.
  4. 4. The method for quantizing data weight based on operator-level blood edge analysis according to claim 1, wherein calculating the data traffic weight according to the obtained operator output data line number and operator input data line number comprises: For the data reduction operator, calculating the data flow weight corresponding to the data reduction operator according to the acquired output data line number of the data reduction operator and the operator input data line number; and for the data reconfiguration operator, calculating the data flow weight corresponding to the data reconfiguration operator according to the acquired left input data line number, right input data line number and output data line number of the data reduction operator.
  5. 5. The method for quantizing data weights based on operator-level blood edge analysis according to claim 1, wherein calculating path depth weights according to the number of upstream tables and the number of downstream tables corresponding to each operator comprises: acquiring the number of upstream tables and the number of downstream tables connected with each operator; adding the upstream table number and the downstream table number to obtain an addition result; and multiplying the addition result by a preset coefficient to obtain the path depth weight.
  6. 6. The method for quantizing data weights based on operator-level blood edge analysis according to claim 1, wherein the weighting and summing the data traffic weight, the operator complexity weight, the dependency relationship weight, and the path depth weight to obtain an edge weight value comprises: Acquiring weight coefficients corresponding to the data flow weight, the operator complexity weight, the dependency relation weight and the path depth weight; constructing the data flow weight, the operator complexity weight, the dependency relation weight and the path depth weight as feature vectors; inputting the feature vector into a trained logistic regression model to obtain a target weight coefficient; And carrying out weighted summation on the data flow weight, the operator complexity weight, the dependency relation weight and the path depth weight based on the target weight coefficient to obtain an edge weight value.
  7. 7. A data weight quantization system based on operator-level blood-edge analysis, the system comprising: The data analysis unit is used for analyzing the SQL statement to be analyzed corresponding to the target scene into an abstract syntax tree; The operator extraction unit is used for traversing the abstract syntax tree, extracting various operators and constructing the dependency relationship among the various operators; The first construction unit is used for constructing an operator directed acyclic graph based on the plurality of operators and the dependency relationship; The second construction unit is configured to construct an operator-level blood edge storage model including operator nodes, table nodes and association relationships among the nodes based on the operator directed acyclic graph, where an attribute of the association relationship includes an edge weight value, and the edge weight value is obtained by weighted summation of a data traffic weight, an operator complexity weight, a dependency relationship weight and a path depth weight, and includes: Calculating data flow weight according to the obtained operator output data line number and operator input data line number; Presetting a complexity score of each operator, and taking the complexity score as an operator complexity weight; calculating importance scores of each operator in the operator directed acyclic graph through PageRank, and calculating dependency weights based on the importance scores; calculating path depth weight according to the number of upstream tables and the number of downstream tables corresponding to each operator, wherein the number of the upstream tables is the number of tables from the input table to the operator, and the number of the downstream tables is the number of tables from the operator to the output table; carrying out weighted summation on the data flow weight, the operator complexity weight, the dependency relation weight and the path depth weight to obtain an edge weight value; And the weight quantization unit is used for normalizing the edge weight value to obtain a normalized edge weight value so as to realize data weight quantization.
  8. 8. An electronic device comprising at least one control processor and a memory communicatively coupled to the at least one control processor, the memory storing instructions executable by the at least one control processor to enable the at least one control processor to perform the operator-level blood-edge-analysis-based data weight quantification method of any of claims 1 to 6.
  9. 9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the operator-level blood edge analysis-based data weight quantization method of any one of claims 1 to 6.

Description

Data weight quantization method and system based on operator-level blood margin analysis Technical Field The application relates to the technical field of data management, in particular to a data weight quantization method and system based on operator-level blood edge analysis. Background The data blood-edge is a relation formed in the whole life cycle from generation, processing and processing to circulation, the source, conversion and dependency of data are recorded, the traditional data blood-edge analysis is mainly divided into two types of table-level blood-edge and field-level blood-edge, and the blood-edge relation among data is constructed by analyzing SQL sentences, ETL scripts or data processing operations. However, the prior art methods have significant limitations. First, at the analysis granularity, they cannot reveal the true conversion logic of the data inside SQL. While database query optimizers, when executing SQL, generate internal operator-based execution plans, such as PROJECT, JOIN, FILTER and AGGREGATE, these plans are instantaneous, serve query execution, and their information is not persisted or modeled, resulting in the inability of data governors to access and utilize these fine-grained process information. Secondly, in relation characterization, most of the existing blood edges are qualitative descriptions, and a quantitative evaluation mechanism is lacking. This makes it impossible to objectively determine which segment of the data processing process has the greatest impact on downstream in the face of complex data links, which operator is the critical propagation node for data quality problems. Finally, in quantization attempts, the existing methods generally rely on a single index (such as data lines), and fail to comprehensively consider multidimensional factors such as dynamic changes of data traffic, computational complexity of operators themselves, centrality of operators in the whole data stream topology structure, complexity of processing paths and the like. Therefore, there is an urgent need for a technique that can transform the operator-level execution logic within a database into a persistent, queriable, and quantifiable data governance model to solve the technical problems of coarse granularity of blood-lineage analysis and lack of efficient quantitative evaluation. Disclosure of Invention The application aims to provide a data weight quantization method and a data weight quantization system based on operator-level blood margin analysis, which can provide fine granularity and quantifiable blood margin analysis support for key operator positioning in data quality tracing and query optimization in data processing. In a first aspect, an embodiment of the present application provides a method for quantifying data weight based on operator-level blood edge analysis, where the method includes: Analyzing the SQL statement to be analyzed corresponding to the target scene into an abstract syntax tree; traversing the abstract syntax tree, extracting various operators, and constructing the dependency relationship among the various operators; constructing an operator directed acyclic graph based on the plurality of operators and the dependency relationship; Constructing an operator-level blood edge storage model comprising operator nodes, table nodes and association relations among the nodes based on the operator directed acyclic graph, wherein the attribute of the association relation comprises an edge weight value, and the edge weight value is obtained by weighting and summing data flow weights, operator complexity weights, dependency relation weights and path depth weights; Normalizing the edge weight value to obtain a normalized edge weight value so as to realize data weight quantization. Compared with the prior art, the first aspect of the application has the following beneficial effects: The method comprises the steps of analyzing SQL sentences to be analyzed corresponding to a target scene into an abstract syntax tree, traversing the abstract syntax tree, extracting various operators, constructing dependency relations among the various operators, constructing an operator directed acyclic graph based on the various operators and the dependency relations, constructing an operator-level blood edge storage model containing operator nodes, table nodes and association relations among the nodes based on the operator directed acyclic graph, enabling the attribute of the association relation to contain an edge weight value, obtaining the edge weight value through weighting and summing of data flow weight, operator complexity weight, dependency relation weight and path depth weight, and normalizing the edge weight value to obtain the normalized edge weight value so as to realize data weight quantization. Therefore, by constructing the dependency relationship among operators, constructing an operator directed acyclic graph and constructing an operator-level blood edge storage model,