CN-121092638-B - Method and system for automatically finding association relation between data tables
Abstract
The invention discloses a method and a system for automatically discovering association relations among data tables. The method comprises the steps of collecting fields which can be used as association conditions from a plurality of target source databases, extracting unique values of the fields as characteristic values, caching the unique values of the fields into a memory database by using a key-value structure, taking the key as the characteristic value, taking the value as a full-qualifier character string from which the characteristic values are derived, additionally storing the value of the same key, carrying out cross comparison on characteristic value sets of the fields, identifying equal or containing relations among the fields through set operation, introducing mathematical group theory to automatically classify the equal or containing relations among the fields into containing groups or equivalent groups, enabling the fields in the groups to be no longer mutually compared, and automatically mapping the classified data group relations into the association relations among the data tables to construct a structured data table relation library. The invention can realize the automatic discovery of the association between the data tables in the multi-source heterogeneous environment, improve the efficiency, reduce the manual dependence and provide support for data integration and analysis.
Inventors
- LIU MUDONG
- YANG XINYU
Assignees
- 河北雄安众信通慧科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20250815
Claims (10)
- 1. A method for automatically discovering association relationships between data tables, comprising: respectively acquiring fields which can be used as data table association conditions for a plurality of target source databases; Extracting a unique value of a field as a characteristic value of the field, caching the unique value into a memory database, and adding a value corresponding to the same key into the key-value structure in an additional mode when caching a new characteristic value, wherein the storage format adopts a key-value structure, the key is a characteristic value of the field, the value is a full qualifier of a characteristic value source, and the full qualifier is recorded in a character string form; The method comprises the steps of determining two fields with at least one identical characteristic value as fields to be compared, solving the intersection of the characteristic value sets of the two fields, respectively calculating the difference between the characteristic value sets of the two fields and the intersection, and judging the association relation of the two fields according to the difference condition; Introducing mathematical group theory to automatically collect equality or inclusion relation among fields into inclusion groups or equivalence groups, wherein fields in the groups are not mutually compared; And automatically mapping the collected data set relationship into an association relationship among the data tables, and constructing a structured data table relationship library.
- 2. The method of claim 1, wherein extracting the unique value of the field as the characteristic value of the field for caching in the in-memory database comprises: when the number of the unique values extracted from the fields exceeds a preset number threshold, randomly sampling the unique values of the fields to form characteristic values, and caching the characteristic values into a memory database.
- 3. The method of claim 2, wherein determining the association of the two fields based on the difference set condition comprises: if the difference sets of the two fields are empty, judging that the equality relationship of the two fields is established; if the difference set of only one field is empty, judging that the two fields contain the relationship is true, and if the difference set is empty, the two fields are subset fields; If the difference set of the two fields is not null, the association relationship of the two fields is further judged by combining the passing rate threshold value.
- 4. The method of claim 3, wherein, in the case that the feature value sets of the two fields are not formed by sampling, if the difference set of the two fields is not null, further determining the association relationship of the two fields in combination with the pass rate threshold comprises: according to the intersection passing rate=the number of eigenvalues of the intersection/the number of eigenvalues of the own field, the intersection passing rate of the two fields is calculated respectively, and the association relationship of the two fields is judged according to the condition of the intersection passing rate: If the intersection passing rate of the two fields is equal to the passing rate threshold value, judging that the equality relation of the two fields is established, and judging that the data quality problem exists in the respective difference part of the two fields; If the intersection passing rate of only one field is greater than the passing rate threshold value, judging that the two fields have a relationship, wherein the intersection passing rate is the subset field, and the difference part of the subset field judges that the data quality problem exists; if the intersection passing rate of the two fields is < = passing rate threshold value, judging that the two fields are irrelevant.
- 5. The method of claim 3, wherein, in the case that one of the feature value sets of the two fields is formed by sampling and one of the feature value sets is formed by not sampling, if the difference set of the two fields is not empty, further determining the association relationship of the two fields in combination with the pass rate threshold value comprises: Assuming that the characteristic value set of the first field in the two fields is formed by sampling, and the characteristic value set of the second field is formed by sampling; Randomly extracting a certain number of characteristic values from the difference set of the first field according to the sampling proportion, verifying the characteristic values in a source database where the second field is located, and calculating the difference set verification passing rate of the first field; for the second field, calculating the intersection passing rate of the second field according to the intersection passing rate = the number of the eigenvalues of the intersection/the number of the eigenvalues of the own field; judging the association relation of the two fields according to the difference verification passing rate of the first field and the intersection passing rate of the second field: if the two passing rates are equal to the passing rate threshold value, judging that the equality relationship of the two fields is established; If only one passing rate is greater than the passing rate threshold value, judging that the two fields have a relationship, wherein the passing rate is a subset field, and the non-passing difference set verification part of the subset field judges that the data quality problem exists; if both pass rates < = pass rate threshold, then the two fields are determined to be irrelevant.
- 6. The method of claim 3, wherein, in the case where the feature value sets of the two fields are formed by sampling, if the difference set of the two fields is not null, further determining the association relationship of the two fields in combination with the pass rate threshold comprises: And randomly extracting a certain number of characteristic values from the difference set of the two fields according to the sampling proportion, verifying the characteristic values in a source database where the comparison field is positioned, respectively calculating the difference set verification passing rate of the two fields, and judging the association relation of the two fields according to the difference set verification passing rate condition: if the difference set verification passing rate of the two fields is equal to the passing rate threshold value, judging that the equality relation of the two fields is established, and judging that the data quality problem exists by the respective non-passing difference set verification part of the two fields; if the difference set verification passing rate of only one field is greater than the passing rate threshold value, judging that the two fields have a relationship, wherein the two fields with a higher passing rate are subset fields, and the part of the subset fields which does not pass the difference set verification judges that the data quality problem exists; If the difference set of the two fields verifies that the passing rate is < = the passing rate threshold value, judging that the two fields are irrelevant.
- 7. The method of any of claims 1-6, wherein introducing mathematical group theory automatically clusters equality or inclusion relationships between fields into inclusion groups or equivalence groups, the intra-group fields no longer being in contrast to each other comprises: the field A, B, C, D is subjected to the following clustering process by introducing mathematical group theory: if A B, then B is a subset of a, grouped into inclusion groups of a; If at the same time satisfy A B and B A, judging that a=b, and classifying a and B into the same equivalent group; if b= A, C = A, D =a is known, according to the equality relation conduction, deducing b= C, B = D, C =d, and grouping B, C, D into the same equality group of a, wherein fields in the equality group have equality relation and are not compared with each other; if A is present B、A C、A When D, B, C, D is grouped into an A-containing group, and the fields in the A-containing group are all subsets of A and are not mutually compared; if A is present B but A </SUB > C, A </SUB > D, then C and D need not be compared to B, only the relationship between C and D is compared, and the comparison to A is not repeated.
- 8. The method of claim 1, wherein the separately collecting fields for a plurality of target source databases that are likely to be used as data table association conditions comprises: filtering out data types that are not suitable for use as data table association conditions; determining a primary key, a unique value, a data combination item of record uniqueness, a field of which the data type meets the condition and the number of non-empty records is greater than a set ratio of the percentage of the total records to be possibly used as a field of a data table association condition, and When the source database is an oversized database, the amount of collected data is reduced by defining field sampling conditions.
- 9. A system for automatically discovering relationships between data tables, comprising: The system comprises a sub-source acquisition module, a data table correlation module and a data table correlation module, wherein the sub-source acquisition module is used for respectively acquiring fields possibly used as the data table correlation conditions for a plurality of target source databases; the feature value caching module is used for extracting a unique value of a field as a feature value of the field to be cached in the memory database, the storage format adopts a key-value structure, wherein a key is a feature value of the field, the value is a full qualifier of a source of the feature value, the full qualifier is recorded in a character string form, and the format is that the database is accessed, the object is accessed, the data table is the field, and subsequently, when a new feature value is cached, the value corresponding to the same key is added into the key-value structure in an additional mode; the cross comparison module is used for carrying out cross comparison on the characteristic value sets of the fields, and identifying equality or inclusion relation among the fields through set operation, and comprises the steps of determining two fields with at least one identical characteristic value as fields to be compared, solving the intersection of the characteristic value sets of the two fields, respectively calculating the difference between the characteristic value sets of the two fields and the intersection, and judging the association relation of the two fields according to the difference condition; The group theory gathering module is used for introducing mathematical group theory to automatically gather equality or inclusion relation among fields into inclusion groups or equivalent groups, and fields in the groups are not mutually compared; and the structured mapping module is used for automatically mapping the collected data set relationship into the association relationship among the data tables and constructing a structured data table relationship library.
- 10. An electronic device, comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor, which when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-8.
Description
Method and system for automatically finding association relation between data tables Technical Field The invention belongs to the fields of intelligent data analysis, data integration and data management, and particularly relates to a method and a system for automatically finding association relations among data tables, which are suitable for application scenes such as a relational database management system (RDBMS), multi-source heterogeneous data integration, automatic data modeling/management, intelligent data pipeline/data blood-logging, automatic data quality and relation finding and the like. Background In a data-driven business environment, enterprises and institutions often have multiple business systems, and data is stored in different databases in a scattered manner, forming "data islands". Existing data integration schemes, such as data warehouse, business intelligence (Business Intelligence, BI) tools, ETL (Extract-Transform-Load) flows, etc., are highly dependent on the association relationship between the manual configuration tables of data engineers or business specialists, which is not only costly, but also difficult to cope with the complexity of multi-source heterogeneous data environments. The existing automatic discovery technology of the association relation between the data tables has the following problems that (1) manual analysis is highly dependent, the data integration process is highly dependent on manual experience and manual configuration, the efficiency is low, mistakes are easy to occur, (2) automation is insufficient, most of the data integration process is dependent on a main external key, metadata or a training set, the data integration process is limited to ' strong relation ' identification in a single system, weak relation ' or ' implicit relation ' cannot be found across a multi-source environment, and (3) the data application threshold is high, the data analysis, the cross-table query and the data integration application are limited to ' table relation opacity ' problems, the efficiency of the data analysis, the cross-table query and other applications is low, and the availability of data assets is limited. Therefore, a technical solution is needed that can automatically identify multiple complex relationships (such as equivalence, inclusion, weak equivalence, etc.) and construct a traceable relationship library based on the data content itself, compatible with multi-source heterogeneous environments. Disclosure of Invention The invention provides a method and a system for automatically finding association relations among data tables, which aim to solve the technical problems of 1) reducing dependence of data association analysis on manual experience and labor cost, 2) realizing automatic identification of weak relations and implicit relations among the data tables in a multi-source heterogeneous environment, 3) ensuring traceability and verifiability of relation identification results, and 4) improving data integration and data analysis efficiency and providing a basis for automatic generation of follow-up data links and intelligent data integration. In order to achieve the above purpose, the present invention proposes the following technical scheme: According to a first aspect of the present invention, a method for automatically discovering association relationships between data tables is provided, including: respectively acquiring fields which can be used as data table association conditions for a plurality of target source databases; Extracting a unique value of a field as a characteristic value of the field, caching the unique value into a memory database, and adding a value corresponding to the same key into the key-value structure in an additional mode when caching a new characteristic value, wherein the storage format adopts a key-value structure, the key is a characteristic value of the field, the value is a full qualifier of a characteristic value source, and the full qualifier is recorded in a character string form; The method comprises the steps of determining two fields with at least one identical characteristic value as fields to be compared, solving the intersection of the characteristic value sets of the two fields, respectively calculating the difference between the characteristic value sets of the two fields and the intersection, and judging the association relation of the two fields according to the difference condition; Introducing mathematical group theory to automatically collect equality or inclusion relation among fields into inclusion groups or equivalence groups, wherein fields in the groups are not mutually compared; And automatically mapping the collected data set relationship into an association relationship among the data tables, and constructing a structured data table relationship library. According to a second aspect of the present invention, a system for automatically discovering association relationships between data tables is provide