US-20260127196-A1 - MAPPING DISPARATE DATASETS
Abstract
Disclosed are systems and method of mapping data entries originating in different systems. A plurality of data entries from different systems are normalized such that they can be compared to each other and mapped, even though the data entries are defined by data fields with differing phrases, descriptive details, and lengths of detail. Data entries may be filtered according to data fields before a mapping operation is employed for mapping. The mapping operation evaluates similarity scores based on the data fields using a combination of exact matching algorithms, dictionary matching algorithms, and text mining algorithms. The mapped data entries and data fields are displayed to a user.
Inventors
- Kamila Rywelska
- Carleton J. Lindgren
- Manesh Saini
- Hasan Adem Yilmaz
Assignees
- Wells Fargo Bank N.A,
Dates
- Publication Date
- 20260507
- Application Date
- 20251229
Claims (20)
- 1 . A computer system comprising one or more processors and a non-transitory memory storing instructions executable by the one or more processors to cause the computer system to: obtain, from at least two distinct data sources, a first set of data entries and a second set of data entries, each data entry comprising a plurality of data fields including textual content; determine candidate mappings between data entries in the first set and data entries in the second set by: generating multiple similarity measures for corresponding data fields using two or more different types of matching techniques; and evaluating the similarity measures, including combining the similarity measures into an aggregate similarity value, and identifying candidate mappings whose evaluation satisfies a mapping criterion; verify at least one aspect of the candidate mappings using a verification analysis configured to evaluate whether one or more relationship parameters in the mappings satisfy at least one verification criterion; generate a mapping report identifying the candidate mappings determined to satisfy both the mapping criterion and the verification criterion; present the mapping report to a user and receive modifications or feedback from at least one user, system process, or external data source relating to at least one mapping; update at least one of the matching techniques, similarity measures, thresholds, or decision criteria based on the received modifications or feedback; and reevaluate or update at least part of the candidate mappings using the updates, and update the mapping report accordingly.
- 2 . The computer system of claim 1 , wherein at least one of the matching techniques is exact matching, fuzzy matching, semantic or meaning-based matching, statistical matching, or machine-learning-based similarity assessment.
- 3 . The computer system of claim 1 , wherein the matching techniques comprise at least one exact matching technique that compares one or more strings in a data field of a data entry in the first set to one or more strings in a corresponding data field of a data entry in the second set, and wherein the exact matching technique comprises normalizing at least one of whitespace differences, punctuation differences, delimiter-character differences, or common misspellings prior to comparing the one or more strings.
- 4 . The computer system of claim 1 , wherein the matching techniques comprise a dictionary matching technique performed by comparing one or more strings in a data field of a data entry in the first set and one or more strings in a data field of a data entry in the second set to a dictionary data store, the dictionary data store indicating that the compared strings correspond in meaning.
- 5 . The computer system of claim 1 , wherein receiving the modifications or feedback comprises receiving a user input that modifies at least one entry of a dictionary data store used by a dictionary matching technique to produce a modified dictionary, and wherein reevaluating or updating at least part of the candidate mappings comprises reperforming at least part of the candidate mapping determination using the modified dictionary.
- 6 . The computer system of claim 1 , wherein the relationship parameters are topical or contextual relationship parameters.
- 7 . The computer system of claim 1 , wherein the plurality of data fields comprises at least one of: (i) an identifier data field, (ii) a title data field, or (iii) one or more descriptor data fields.
- 8 . The computer system of claim 1 , wherein the textual content comprises user-entered free-form phrases.
- 9 . The computer system of claim 1 , wherein determining the candidate mappings further comprises filtering at least one of the first set of data entries or the second set of data entries based on at least one filter parameter derived from at least one data field, the filtering being performed prior to generating the multiple similarity measures, and the generating of the multiple similarity measures being performed using the filtered first set of data entries and the filtered second set of data entries.
- 10 . The computer system of claim 1 , wherein generating the multiple similarity measures comprises generating: (i) a first similarity score based on similarity of an entirety of a string in a data field, and (ii) a second similarity score based on similarity of a portion of the string.
- 11 . The computer system of claim 1 , wherein generating the multiple similarity measures comprises generating a token-reordered similarity measure by forming tokens from a string, ordering the tokens, and evaluating similarity based on the ordered tokens.
- 12 . The computer system of claim 1 , wherein combining the similarity measures into the aggregate similarity value comprises computing a weighted combination of: (i) a precision-matching similarity measure, (ii) a dictionary-matching similarity measure, and (iii) a text-analytics similarity measure.
- 13 . The computer system of claim 1 , wherein identifying candidate mappings whose evaluation satisfies the mapping criterion comprises determining that at least a predetermined number of similarity measures associated with different data fields for a given pair of data entries exceed one or more thresholds.
- 14 . The computer system of claim 1 , wherein generating the multiple similarity measures comprises applying a chained set of text-analytics techniques including two or more of: Levenshtein distance, latent semantic index (LSI), cosine similarity, latent Dirichlet allocation, Jensen-Shannon divergence, or Word Mover's Distance.
- 15 . The computer system of claim 1 , wherein the verification analysis comprises verifying at least one topical relationship parameter by: estimating a set of topics from at least one of the first set of data entries or the second set of data entries; and validating a number of topics in the set of topics based on a comparison of the number of topics to a result of an n-gram analysis.
- 16 . The computer system of claim 1 , wherein obtaining the first set of data entries and the second set of data entries is initiated in response to a trigger comprising at least one of: (i) a user input, (ii) a periodic schedule, or (iii) an external message indicating an update to at least one of the distinct data sources.
- 17 . The computer system of claim 1 , wherein presenting the mapping report comprises rendering the mapping report via a graphical user interface that includes: (i) a review control configured to present at least one mapping assumption comprising at least one of: a selected matching technique, a threshold, a weighting used to combine similarity measures, a dictionary entry, or a topic-related parameter used by the verification analysis; and (ii) an edit control configured to receive at least one correction to the mapping assumption as the modifications or feedback.
- 18 . The computer system of claim 1 , wherein reevaluating or updating at least part of the candidate mappings comprises reperforming the determining of candidate mappings and the verification analysis for a subset of the candidate mappings associated with at least one modified data field or at least one modified dictionary entry, without reprocessing all candidate mappings.
- 19 . A computer-implemented method comprising: obtaining, from at least two distinct data sources, a first set of data entries and a second set of data entries, each data entry comprising a plurality of data fields including textual content; determining candidate mappings between data entries in the first set and data entries in the second set by: generating multiple similarity measures for corresponding data fields using two or more different types of matching techniques; and evaluating the similarity measures, including combining the similarity measures into an aggregate similarity value, and identifying candidate mappings whose evaluation satisfies a mapping criterion; verifying at least one aspect of the candidate mappings using a verification analysis configured to evaluate whether one or more relationship parameters in the mappings satisfy at least one verification criterion; generating a mapping report identifying the candidate mappings determined to satisfy both the mapping criterion and the verification criterion; presenting the mapping report to a user and receiving modifications or feedback from at least one user, system process, or external data source relating to at least one mapping; updating at least one of the matching techniques, similarity measures, thresholds, or decision criteria based on the received modifications or feedback; and reevaluating or updating at least part of the candidate mappings using the updates, and updating the mapping report accordingly.
- 20 . A non-transitory computer-readable storage medium storing instructions executable by one or more processors to cause a computer system to: obtain, from at least two distinct data sources, a first set of data entries and a second set of data entries, each data entry comprising a plurality of data fields including textual content; determine candidate mappings between data entries in the first set and data entries in the second set by: generating multiple similarity measures for corresponding data fields using two or more different types of matching techniques; and evaluating the similarity measures, including combining the similarity measures into an aggregate similarity value, and identifying candidate mappings whose evaluation satisfies a mapping criterion; verify at least one aspect of the candidate mappings using a verification analysis configured to evaluate whether one or more relationship parameters in the mappings satisfy at least one verification criterion; generate a mapping report identifying the candidate mappings determined to satisfy both the mapping criterion and the verification criterion; present the mapping report to a user and receive modifications or feedback from at least one user, system process, or external data source relating to at least one mapping; update at least one of the matching techniques, similarity measures, thresholds, or decision criteria based on the received modifications or feedback; and reevaluate or update at least part of the candidate mappings using the updates, and update the mapping report accordingly.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/136,577 filed Apr. 19, 2023, which is a continuation of U.S. patent application Ser. No. 17/109,027 filed Dec. 1, 2020, which claims priority to U.S. Provisional Patent Application No. 62/98,9479 titled “MAJOR REQUIREMENT EVALUATION TO PROCESS MAPPING,” filed Mar. 13, 2020, the entirety of each of which is hereby incorporated by reference in its entirety. TECHNICAL FIELD The present disclosure relates to normalizing datasets and mapping of disparate datasets having varying fields and field types, formats, and/or content. BACKGROUND Data may be managed, used, and stored in separate systems. However, there may be cases where data in a first dataset, stored and managed in a first system, may supplement or modify data in a second dataset, stored and managed in a second system. For example, two different users (groups of users, entities, and the like) may independently manage data. While the users manage the data differently (different format, varying language, varying length, varying descriptions of detail) the data from the first dataset may be applied to supplement (benefit or constrain) the data in the second dataset. The data in the first dataset may also supplement data in one or more other datasets. Further, the data in the second dataset may be supplemented by one or more other datasets in addition to the data in the first dataset. Accordingly, an effective mechanism for identifying the interrelatedness of vast amounts of data of varying types is necessary. Manual attempts at determining interrelations between data may be inconsistent, unreliable and/or unfeasible given the volumes of data and the inability of reviewers to spend time tracking changes in the data and re-identifying relationships. SUMMARY In one aspect, various embodiments of the disclosure relate to a computer-implemented method, comprising: a memory storing instructions; and a processor configured to execute the instructions to perform operations comprising: retrieving, from a first system, a plurality of first data entries, each of the first data entries comprising a first plurality of data fields; retrieving, from a second system, a plurality of second data entries, each of the second data entries comprising a second plurality of data fields; performing a mapping operation on the first and second data entries, the mapping operation comprising an application of a combination of (i) one or more precision matching algorithms, (ii) one or more concordance matching algorithms, and (iii) one or more text analytics algorithms to the first plurality of data fields and the second plurality of data fields, the mapping operation comprising generating similarity scores for first and second pluralities of data fields; generating, based on the similarity scores and one or more thresholds, a map connecting first data entries to second data entries; and displaying the map indicating which ones of the first data entries are connected to which ones of the second data entries. Various embodiments of the disclosed inventions related to a computer-implemented method, comprising: obtaining, from a first system, a set of requirements defined by a first set of data fields comprising a first plurality of user-entered free-form phrases; obtaining, from a second system, a set of processes defined by a second set of data fields comprising a second plurality of user-entered free-form phrases; generating, for each process in the set of processes, a subset of the set of requirements impacting the process by performing a mapping operation configured to map processes to requirements by evaluating one or more similarity scores based on the first and second sets of data fields, wherein the mapping operation comprises an application of a combination of (i) one or more exact matching algorithms, (ii) one or more dictionary matching algorithms, and (iii) one or more text mining algorithms to the first set of data fields and the second set of data fields; and displaying a map linking the set of requirements to the set of processes, the map indicating which requirements are connected to which processes. Various embodiments of the disclosed inventions related to a computer-implemented method, comprising: retrieving, from a first system, a plurality of first data entries, each of the first data entries comprising a first plurality of data fields; retrieving, from a second system, a plurality of second data entries, each of the second data entries comprising a second plurality of data fields; performing a mapping operation on the first and second data entries, the mapping operation comprising an application of a combination of (i) one or more precision matching algorithms, (ii) one or more concordance matching algorithms, and (iii) one or more text analytics algorithms to the first plurality of data fields and the second plurality of data fields, the mapping operati