US-12619683-B2 - Artificial intelligence (AI) based data matching and alignment

US12619683B2US 12619683 B2US12619683 B2US 12619683B2US-12619683-B2

Abstract

An Artificial Intelligence (AI)-based data matching and alignment system identifies similar data sources for a target data source from a data corpus and generates a knowledge graph that enables downstream applications seamless access to data in the data corpus. The system extracts column features at different levels for the target data source and a plurality of data sources from the data corpus. Feature matrices are built from the features of the target data source and the plurality of data sources. Candidate data sources similar to the target data source are filtered from the plurality of data sources using the feature matrices. The tree-based similarity is estimated and K Nearest Neighbor (KNN) graphs are built to identify columns from the candidate data sources that are similar to columns of the target data source to build the knowledge graph.

Inventors

Neda ABOLHASSSANI
Maziyar Baran Pouyan
Teresa Sheausan Tung
Andrew Fano
Sayantan Mitra

Assignees

ACCENTURE GLOBAL SOLUTIONS LIMITED

Dates

Publication Date: 20260505
Application Date: 20210825
Priority Date: 20210504

Claims (17)

1 . An Artificial Intelligence (AI) based data matching and alignment system, comprising: at least one processor; a non-transitory processor-readable medium storing machine-readable instructions that cause the processor to: receive a request for identifying similar data for a target data source from a plurality of data sources; generate feature matrices for the target data source and the plurality of data sources, wherein each feature matrix of the feature matrices includes respective features of the target data source and the plurality of data sources; identify at least one candidate data source that is similar to the target data source from the plurality of data sources, wherein the at least one candidate data source is identified based on corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source, wherein the identification of the at least one candidate data source provides for preliminary filtering to select a subset of the plurality of data sources; identify columns of the at least one candidate data source that are similar to columns of the target data source by matching columns of the target data source with columns of the subset of the plurality of data sources and by applying tree-based similarity calculations to a feature matrix of the at least one candidate data source and the feature matrix of the target data source, wherein the similarities are determined based at least on feature matrices of the target data source and the features of the at least one candidate data sources; enable a display of one or more of the columns of the at least one candidate data source that are similar columns to the columns of the target data source; generate a knowledge graph that represents the similar columns of the at least one candidate data source and the target data source; and enable functioning of a downstream application to access the knowledge graph for information extraction, wherein enabling the downstream application to access the knowledge graph comprising: obtaining output that includes at least one of: a portion of the knowledge graph, the knowledge graph representing the columns as nodes, wherein the similar columns are connected by edges of the knowledge graph, and a distance between the nodes signifies a column similarity; and a ranked list of similarity mappings between the columns of the target data source and the candidate data sources along with respective similarity percentages for the similarity mappings; and executing the functions of the downstream application based on the obtained output.
2 . The data matching and alignment system of claim 1 , wherein the features include character level features, semantic level features, and data dependency features.
3 . The data matching and alignment system of claim 1 , wherein the character level features include at least column data type features and character distribution features.
4 . The data matching and alignment system of claim 1 , wherein the semantic level features include at least semantic text features and numeric distribution comparison features.
5 . The data matching and alignment system of claim 1 , wherein to identify the at least one candidate data source similar to the target data source, the processor is to: generate a K Nearest Neighbor (KNN) graph on implementing a distance metric to the corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source.
6 . The data matching and alignment system of claim 5 , wherein the distance metric includes Mahanalobis distance.
7 . The data matching and alignment system of claim 1 , wherein to identify the at least one candidate data source similar to the target data source, the processor is to: select nearest N neighbors from the KNN graph as a plurality of candidate data sources, wherein N is a natural number and the at least one candidate data source includes the plurality of candidate data sources.
8 . The data matching and alignment system of claim 1 , wherein to identify the columns of the at least one candidate data source that are similar to the columns of the target data source, the processor is to: generate K Nearest Neighbor (KNN) graphs from the tree-based similarity calculations; and identify the columns of the at least one candidate data source that are similar to the columns of the target data source from the KNN graphs.
9 . A method of generating similarity mappings between data sources comprising: receiving a request for identifying matching data for a target data source of a plurality of data sources from a data corpus; extracting column features of the target data source and the plurality of data sources, wherein the column features are stored as corresponding feature matrices; identifying one or more candidate data sources from the plurality of data sources that are similar to the target data source, wherein the candidate data sources are identified based on a distance measure obtained for the feature matrix of the target data source and the corresponding feature matrices of the plurality of data sources, wherein the identification of the one or more candidate data source provides for preliminary filtering to select a subset of the plurality of data sources; identifying columns of the one or more candidate data sources that are similar to columns of the target data source by matching columns of the target data source with columns of the subset of the plurality of data sources and by applying tree-based similarity calculations to a feature matrix of the one or more candidate data source and the feature matrix of the target data source, wherein the similarities between the columns are determined based at least on feature matrices of the target data source and the features of one of the one or more candidate data sources; generating a knowledge graph representing the similarities of the columns of the one or more candidate data sources and the columns of the target data source; and enabling functioning of a downstream application by enabling the downstream application to access the knowledge graph, wherein enabling the downstream application to access the knowledge graph comprising: obtaining output that includes at least one of: a portion of the knowledge graph, the knowledge graph representing the columns as nodes, wherein the similar columns are connected by edges of the knowledge graph, and a distance between the nodes signifies a column similarity; and a ranked list of similarity mappings between the columns of the target data source and the candidate data sources along with respective similarity percentages for the similarity mappings; and executing the functions of the downstream application based on the obtained output.
10 . The method of claim 9 , further comprising: providing a display of the columns of the one or more candidate data sources that are similar to the columns of the target data source.
11 . The method of claim 10 , further comprising: providing via the display, a percentage of similarity between each of the similar columns of the one or more candidate data sources and the columns of the target data source.
12 . The method of claim 10 , further comprising: receiving user input selecting one or more of the similar columns for generating the knowledge graph, wherein the user input is received via the display.
13 . The method of claim 9 , wherein the column features include at least character level features, semantic level features, and dependency level features.
14 . The method of claim 13 , further comprising: generating the feature matrices for the plurality of data sources including the target data source, by stacking the character level features, semantic level features, and dependency level features of each of the columns adjacent to each other.
15 . The method of claim 9 , further comprising: outputting reasons for identifying the columns of the one or more candidate data sources as being similar to the columns of the target data source.
16 . A non-transitory processor-readable storage medium comprising machine-readable instructions that cause a processor to: receiving a request for identifying matching data for a target data source of a plurality of data sources from a data corpus; extracting column features of the target data source and the plurality of data sources, wherein the column features are stored as corresponding feature matrices; identifying one or more candidate data sources from the plurality of data sources that are similar to the target data source, wherein the candidate data sources are identified based on a distance measure obtained for the feature matrix of the target data source and the corresponding feature matrices of the plurality of data sources, wherein the identification of the one or more candidate data source provides for preliminary filtering to select a subset of the plurality of data sources; identifying columns of the one or more candidate data sources that are similar to columns of the target data source by matching columns of the target data source with columns of the subset of the plurality of data sources and by applying tree-based similarity calculations to a feature matrix of the one or more candidate data source and the feature matrix of the target data source, wherein the similarities between the columns are determined based at least on feature matrices of the target data source and the one or more candidate data sources; generating a knowledge graph representing the similarities of the columns of the one or more candidate data sources and the columns of the target data source; and enabling functioning of a downstream application by enabling the downstream application to access the knowledge graph, wherein enabling the downstream application to access the knowledge graph comprising: obtaining output that includes at least one of: a portion of the knowledge graph, the knowledge graph representing the columns as nodes, wherein the similar columns are connected by edges of the knowledge graph, and a distance between the nodes signifies a column similarity; and a ranked list of similarity mappings between the columns of a target data source and the candidate data sources along with respective similarity percentages for the similarity mappings; and executing the functions of the downstream application based on the obtained output.
17 . The non-transitory processor-readable storage medium of claim 16 , wherein the instructions to identify the at least one candidate data source as similar to the target data source, further cause the processor to: apply K Nearest Neighbor (KNN) methodology on implementing the distance measure to the corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source.

Description

PRIORITY The present application claims priority under 35 U.S.C. 119(a)-(d) to the Indian Provisional Patent Application Serial No. 202111020371, having a filing date of May 4, 2021, the disclosure of which is hereby incorporated by reference in its entirety. BACKGROUND Complex computing systems that are used today across various domains including manufacturing, energy, finance, healthcare, etc., employ numerous data sources that receive and store data in different formats. These may include structured data sources such as relational database management systems (RDBMS) or unstructured data sources such as those storing data from sensors, scanners, etc. Real-world data has therefore become increasingly complex and as a result, may be prone to errors. For example, real-world data may be incomplete as it may lack certain attributes of interest, attribute values, or contain only aggregate data. Furthermore, real-world data may be noisy and inconsistent as it may contain errors, outliers, discrepancies in codes or names, etc. Such data issues render mapping from raw data into data files a difficult technical problem where approximately 80% of the data science efforts are dedicated to preparing the data. These data preparation tasks are often carried out by data experts and data engineers who have the domain knowledge and the knowledge regarding the data sources so that the data is correctly connected to other data and labeled accurately. Such relationship processing and mapping is a time-consuming process that requires expert knowledge. BRIEF DESCRIPTION OF DRAWINGS Features of the present disclosure are illustrated by way of examples shown in the following figures. In the following figures, like numerals indicate like elements, in which: FIG. 1A shows a block diagram of an Artificial Intelligence (AI)-based data matching and alignment system in accordance with the examples disclosed herein. FIG. 1B shows the generation of feature matrices in accordance with the examples disclosed herein. FIG. 1C shows a block diagram of a data source filter according to the examples disclosed herein. FIG. 1D shows a block diagram of an unsupervised recommender in accordance with the examples disclosed herein. FIG. 2A includes a flowchart for a process of filtering data sources to identify the candidate data sources in accordance with the examples disclosed herein. FIG. 2B includes a flowchart that shows a method of identifying similar columns for a target data source from candidate data sources in accordance with the examples disclosed herein. FIG. 3 shows different outputs generated in accordance with the examples disclosed herein. FIG. 4 shows an example of a knowledge graph-enabled data mesh architecture implemented in accordance with the examples disclosed herein. FIG. 5 shows Input/Output (U/I) User Interfaces (UIs) including a data matching table and a column matchings table generated according to some examples disclosed herein. FIG. 6 shows data from an example use case that can be processed by the system according to some examples disclosed herein. FIG. 7 illustrates a computer system that may be used to implement the AI-based data matching and alignment system according to some examples disclosed herein. DETAILED DESCRIPTION For simplicity and illustrative purposes, the present disclosure is described by referring to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on. An AI-based data matching and alignment system that generates similarity mappings for a target data source from a plurality of data sources in a data corpus is disclosed. In an example, the plurality of data sources can be columnar data sources with data arranged in arrays of rows and columns, e.g., spreadsheets, database tables, database views, etc. When a request for identifying similar data sources with a reference to a target data source is received, the plurality of data sources from the data corpus are initially filtered to identify candidate data sources that are similar to the target data source. The candidate data sources are further analyzed to identify columns from the candidate data sources that are similar to the columns of the target data source. A knowledge graph representing similar columns is generated. The knowledge graph provides structured, well-defined data to downstream