Search

US-12625877-B2 - Systems and methods for organizing data of different local data schemas based on similarity rankings, scoring, and signatures

US12625877B2US 12625877 B2US12625877 B2US 12625877B2US-12625877-B2

Abstract

Systems and methods for organizing data of different local data schemas based on similarity rankings, scoring, and signatures are disclosed. According to an aspect, a system for identifying linkages among entities. The system also includes an entity linkage manager configured to analyze a concept for data schemas to determine similarity scores for entities of local data schemas with respect to entities of a global data schema. The entity linkage manager is also configured to map the entities of the local data schemas to the entities of the global data schema based on the determined similarity scores. Further, the entity linkage manager is configured to associate data within the at least one database based on the mapping for use in accessing related data.

Inventors

  • George KARABATIS
  • Andreas Behrend
  • Leonard Traeger

Assignees

  • UNIVERSITY OF MARYLAND, BALTIMORE COUNTY

Dates

Publication Date
20260512
Application Date
20240311

Claims (16)

  1. 1 . A system for identifying linkages among entities, the system comprising: a computing device including an entity linkage manager configured to: analyze a concept for data schemas to determine similarity scores for entities of local data schemas with respect to entities of a global data schema; determine similarity scores for the entities of the local data schemas; select the entities that are linkable based on the determined similarity scores for the entities of the local data schemas; block profiles of some of the entities of the local data schemas based on the similarity scores; generate and store a map of the entities of the local data schemas to the entities of the global data schema based on the selected entities that are linkable and blocked profiles of some of the entities of the local data schemas; associate data within at least one database based on the map for use in accessing related data; receive, from a user interface, a search query that identifies a search parameter associated with the concept; use the map to determine that data associated with the search parameter is stored at one of the entities of the local data schemas and the global data schema; send the search parameter to the one of the entities in response to the determination that that data associated with the search parameter is stored at one of the entities of the local data schemas and the global data schema; receive, from the one of the entities, a result of sending the search parameter to the one of the entities; and present the result via the user interface.
  2. 2 . The system of claim 1 , wherein the entity linkage manager is configured to determine the similarity scores for the entities of the local data schemas by utilization of fuzzy string analysis, synonym intersection analysis, data type analysis, and/or constraint similarity analysis.
  3. 3 . The system of claim 2 , wherein the entity linkage manager is configured to rank the entities based on the determined similarity scores.
  4. 4 . The system of claim 3 , wherein the entity linkage manager is configured to sort and filter the determined similarity scores for identifying top-ranked entities with low scores.
  5. 5 . The system of claim 4 , wherein the entity linkage manager is configured to sort and filter the determined similarity scores based on a configurable threshold.
  6. 6 . The system of claim 4 , wherein the entity linkage manager is configured to block profiles of some of the entities based on the configurable threshold.
  7. 7 . The system of claim 1 , wherein the entity linkage manager is configured to determine the similarity scores for the entities by quantifying each entity's degree of dispersion.
  8. 8 . The system of claim 1 , wherein the entity linkage manager is configured to determine the similarity scores for the entities by quantifying local deviations of entity signatures from surrounding entity signatures.
  9. 9 . A method for identifying linkages among entities, the method comprising: at a computing device: analyzing a concept for data schemas to determine similarity scores for entities of local data schemas with respect to entities of a global data schema; determining similarity scores for the entities of the local data schemas; selecting the entities that are linkable based on the determined similarity scores for the entities of the local data schemas; blocking profiles of some of the entities of the local data schemas based on the similarity scores; generating and storing a map of the entities of the local data schemas to the entities of the global data schema based on the selected entities that are linkable and blocked profiles of some of the entities of the local data schemas; associating data within at least one database based on the mapping for use in accessing related data; receiving, from a user interface, a search query that identifies a search parameter associated with the concept; using the map to determine that data associated with the search parameter is stored at one of the entities of the local data schemas and the global data schema; sending the search parameter to the one of the entities in response to the determination that that data associated with the search parameter is stored at one of the entities of the local data schemas and the global data schema; receiving, from the one of the entities, a result of sending the search parameter to the one of the entities; and presenting the result via the user interface.
  10. 10 . The method of claim 9 , further comprising determining the similarity scores for the entities of the local data schemas by utilization of fuzzy string analysis, synonym intersection analysis, data type analysis, and/or constraint similarity analysis.
  11. 11 . The method of claim 10 , further comprising ranking the entities based on the determined similarity scores.
  12. 12 . The method of claim 11 , further comprising sorting and filtering the determined similarity scores for identifying top-ranked entities with low scores.
  13. 13 . The method of claim 12 , further comprising sorting and filtering the determined similarity scores based on a configurable threshold.
  14. 14 . The method of claim 12 , further comprising blocking profiles of some of the entities based on the configurable threshold.
  15. 15 . The method of claim 9 , further comprising determining the similarity scores for the entities by quantifying each entity's degree of dispersion.
  16. 16 . The method of claim 9 , further comprising determining the similarity scores for the entities by quantifying local deviations of entity signatures from surrounding entity signatures.

Description

CROSS REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Patent Application No. 63/451,068, filed Mar. 9, 2023, and titled “Systems and Methods of Generating Mappings of Heterogeneous Relational Schemas Using Unsupervised Learning”, the content of which is incorporated herein by reference in its entirety. TECHNICAL FIELD The presently disclosed subject matter relates generally to data management and analysis systems. Particularly, the presently disclosed subject matter relates to systems and method for organizing data of different local data schemas based on similarity rankings, scoring, and signatures. BACKGROUND Large organizations typically have access to large datasets from various sources. Data management and analysis systems can assist people with organizing, accessing, and sorting through such data. However, as time passes the data sets usually grow in size, and therefore it becomes increasingly difficult to make use of it. As a result, there is an increasing interest in mastering the data of an organization and, also, enriching it with external information, thus improving reporting capabilities and knowledge extraction. Human labor can manually export and import external data up to a practical limit. On the other hand, a sustainable consolidation of heterogeneous databases becomes time-consuming and may even lead to failures due to complexity. The “Data Discovery Problem” summarizes how exponential growth of data, desire to consolidate data domain independently, heterogeneous schema architectures, unstructured, unclean and incomplete data, and limited resources of domain and information technology (IT) knowledge form the need for heterogeneous database resolution. To reduce the complexity of O(NN) when consolidating multiple databases with each other, the approach of a mediating global schema serves to abstract, map, and simplify mappings to O(2N) between local heterogeneous databases and schemas. However, the use of Machine Learning to automate mappings between a global concept, for instance, “Customers” with table “Customer” in schema A and table “Client” in schema B is not fully exploited. An increasing number of approaches describe how machine learning techniques and similarity measures between database concepts can enhance Data Integration (DI) and how both disciplines function together in a “natural synergy” to create a “large homogeneous collection of data”. DI systems aim to reach a complete, concise, and consistent homogeneous database state. A key prerequisite to the construction of a DI system is the process of entity resolution or matching, data matching, and record linkage. The research community contributed with applications of machine learning algorithms, but without schema mapping in their scope. In the real world, schema mapping remains “an unavoidable problem” for its downstream phases within a DI pipeline. Nonetheless, schema mapping is often treated completely independently and handcrafted manually during the preprocessing step for machine learning pipelines. There is a continuing need for improved techniques for entity resolution within large collections of data across databases. Particularly, there is a need for reducing the processing burden and other costs with searching large and different databases towards identification and verification of true entity linkages. BRIEF DESCRIPTION OF THE DRAWINGS Having thus described the presently disclosed subject matter in general terms, reference will now be made to the accompanying Drawings, which are not necessarily drawn to scale, and wherein: FIG. 1 is a diagram of modules of an example architecture, split into design and run time in accordance with embodiments of the present disclosure; FIG. 2 is a diagram depicting a Knowledge Graph, which may be visualized in accordance with embodiments of the present disclosure; FIG. 3 depicts a table showing mapping for global attribute Customer Mail Id using total score for N ranked local attribute concepts; FIG. 4 represents a knowledge graph of the global and the two local schemas and their automated mappings; FIG. 5 is a graph that shows an overview of the accuracy of example mappings generated by systems disclosed herein in comparison to the experts' mappings; FIG. 6 depicts an excerpt of a multi-sourced Entity Linkage problem between three entity collections sampled from three schemas of database vendors Oracle (E1), MySQL (E2), and SAP HANA (E3), storing data about customers and products; FIG. 7 is the increasing heterogeneity represented by the colored entity collections represented as ovals with unlinkable entities, while the linked ones are shown in the white overlapping oval; FIG. 8 is a diagram showing entity linkage workflow with scoping in accordance with embodiments of the present disclosure; FIG. 9 is a diagram showing an agent for entity ranking in scoping in accordance with embodiments of the present disclosure; FIGS. 10 and 11 plot the best p