US-12619575-B2 - Systems and methods for performant data matching
Abstract
The present disclosure is directed to systems and methods for performant data matching. Entities maintain large amounts of data and desire to reconcile duplicative records. One way to solve this problem is through data matching. However, standard data matching at the record level can be laborious and inefficient. To remedy these inefficiencies in data matching, the present disclosure describes a system where the token records are tokenized a second time into token sets based on the token records satisfying at least one token set rule. A token set rule may be based on the common presence of multiple tokens in a token record. If multiple token records have the required tokens from the set rule, then those token records can be hashed and rolled-up into the token set (i.e., tokenized a second time into the token set). The token set allows for more efficient data matching.
Inventors
- Curtiss W. Schuler
- Brett A. Norris
- Satyender Goel
Assignees
- COLLIBRA BELGIUM BV
Dates
- Publication Date
- 20260505
- Application Date
- 20240916
Claims (20)
- 1 . A method comprising: increasing efficiency of token comparison by determining, using at least one hashing roll-up function including at least one token set rule, that a first record and a second record satisfy the at least one token set rule, wherein the first record includes a first token having a first token type and the second record includes a second token having a second token type, wherein the token set rule indicates one or more token types that a token needs to have to be associated with the token set rule, and wherein determining includes determining that the first record and the second record include the one or more token types associated with the token set rule; merging the first record and the second record into a token set; determining the token set is not present in at least one token repository; and in response to determining the token set is not present in the at least one token repository, storing the token set in the at least one token repository.
- 2 . The method of claim 1 , further comprising: retokenizing the token set into a single token; and comparing the single token to at least one token record from a third source.
- 3 . The method of claim 1 , further comprising: comparing the token set to at least one token record from a third source; and retokenizing, by the at least one hashing roll-up function, the at least one token record from the third source into the token set.
- 4 . The method of claim 1 , further comprising: identifying at least one overlapping token record in a third source based on the at least one hashing roll-up function.
- 5 . The method of claim 1 , further comprising: identifying at least one duplicate token record in a third source based on the at least one hashing roll-up function.
- 6 . The method of claim 1 , wherein each of the first record and the second record comprise at least one common token.
- 7 . The method of claim 1 , wherein the at least one token set rule is based on a presence of at least one common token, and wherein the at least one token repository is comprised of a plurality of token records from at least one of: a customer source or a reference source.
- 8 . A system comprising: one or more processors; and one or more memories storing instructions that, when executed by the one or more processors, cause the system to perform a process comprising: increasing efficiency of token comparison by determining, using at least one hashing roll-up function including at least one token set rule, that a first record and a second record satisfy the at least one token set rule, wherein the first record includes a first token having a first token type and the second record includes a second token having a second token type, wherein the token set rule indicates one or more token types that a token needs to have to be associated with the token set rule, and wherein determining includes determining that the first record and the second record include the one or more token types associated with the token set rule; merging the first record and the second record into a token set; determining the token set is not present in at least one token repository; and in response to determining the token set is not present in the at least one token repository, storing the token set in the at least one token repository.
- 9 . The system of claim 8 , wherein the process further comprises: retokenizing the token set into a single token; and comparing the single token to at least one token record from a third source.
- 10 . The system of claim 8 , wherein the process further comprises: comparing the token set to at least one token record from a third source; and retokenizing, by the at least one hashing roll-up function, the at least one token record from the third source into the token set.
- 11 . The system of claim 8 , wherein the process further comprises: identifying at least one overlapping token record in a third source based on the at least one hashing roll-up function.
- 12 . The system of claim 8 , wherein the process further comprises: identifying at least one duplicate token record in a third source based on the at least one hashing roll-up function.
- 13 . The system of claim 8 , wherein each of the first record and the second record comprise at least one common token.
- 14 . The system of claim 8 , wherein the at least one token set rule is based on a presence of at least one common token, and wherein the at least one token repository is comprised of a plurality of token records from at least one of: a customer source or a reference source.
- 15 . A non-transitory computer-readable medium storing instructions that, when executed by a computing system, cause the computing system to perform operations comprising: increasing efficiency of token comparison by determining, using at least one hashing roll-up function including at least one token set rule, that a first record and a second record satisfy the at least one token set rule, wherein the first record includes a first token having a first token type and the second record includes a second token having a second token type, wherein the token set rule indicates one or more token types that a token needs to have to be associated with the token set rule, and wherein determining includes determining that the first record and the second record include the one or more token types associated with the token set rule; merging the first record and the second record into a token set; determining the token set is not present in at least one token repository; and in response to determining the token set is not present in the at least one token repository, storing the token set in the at least one token repository.
- 16 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: retokenizing the token set into a single token; and comparing the single token to at least one token record from a third source.
- 17 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: comparing the token set to at least one token record from a third source; and retokenizing, by the at least one hashing roll-up function, the at least one token record from the third source into the token set.
- 18 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: identifying at least one overlapping token record in a third source based on the at least one hashing roll-up function.
- 19 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: identifying at least one duplicate token record in a third source based on the at least one hashing roll-up function.
- 20 . The non-transitory computer-readable medium of claim 15 , wherein the at least one token set rule is based on a presence of at least one common token, and wherein the at least one token repository is comprised of a plurality of token records from at least one of: a customer source or a reference source.
Description
CROSS-REFERENCE TO RELATED APPLICATION(S) This application is a continuation of U.S. patent application Ser. No. 18/334,965, filed Jun. 14, 2023, entitled “SYSTEMS AND METHODS FOR PERFORMANT DATA MATCHING,” which is a continuation of U.S. patent application Ser. No. 17/369,798, filed Jul. 7, 2021, entitled “SYSTEMS AND METHODS FOR PERFORMANT DATA MATCHING,” which is related to U.S. patent application Ser. No. 16/776,293 titled “SYSTEMS AND METHOD OF CONTEXTUAL DATA MASKING FOR PRIVATE AND SECURE DATA LINKAGE”; U.S. patent application Ser. No. 17/103,751, titled “SYSTEMS AND METHODS FOR UNIVERSAL REFERENCE SOURCE CREATION AND ACCURATE SECURE MATCHING”; U.S. patent application Ser. No. 17/103,720, titled “SYSTEMS AND METHODS FOR DATA ENRICHMENT”; and U.S. patent application Ser. No. 17/219,340, titled “SYSTEMS AND METHODS FOR AN ON-DEMAND, SECURE, AND PREDICTIVE VALUE-ADDED DATA MARKETPLACE,” which are hereby incorporated by reference in their entirety. TECHNICAL FIELD The present disclosure relates to data matching techniques. BACKGROUND Entities maintain large amounts of data that may be disorganized and/or incomplete. For example, an entity may maintain more than one incomplete record related to a subject, such as an individual, product, organization, etc. One record may contain a subject's address, email, gender, and geographic location, while the other record may contain a subject's name, address, phone number, date-of-birth, and credit card information. Each of these records may be incomplete. Similar incomplete records may exist for products and organizations. Currently, entities desiring to reconcile these disjointed records typically must manually combine these records, which results in inefficient and time-consuming processes, as well as potential exposure to personally identifiable information. Linking, de-duplicating, or finding matches between data assets remains one of the biggest problems of most organizations due to the breadth of data sources. Multi-channel data collection and in-house replication for serving the need of customers and business stakeholders poses challenges of reconciliation. Other problems like incompleteness, data entry mistakes, and changes over time, make it difficult for establishing a master (i.e., a “golden”) record of data. Another issue that entities face is ensuring data integrity of the records they may possess. For example, an entity may have two incomplete records seemingly associated with the same data subject. However, one record may list a different email address or phone number than the other. Such data discrepancies decrease the integrity of data records and make it more difficult for an entity to reconcile multiple, incomplete data records because the entity may be unsure which record is actually correct. Furthermore, an entity may be unsure to what extent a certain record may be correct. Modern day enterprises are hampered when it comes to accurate data collection and reconciliation. Further, data matching and data linkage are computationally expensive when applied to data subjects with multiple features, characteristics, and attributes, such as people. The underlying complexity of data matching with these data subjects is multiplied due to the required processing of permutations to achieve a higher accuracy of match results. As such, there is an increased need for systems and methods that can address the challenges of modern-day data collection and reconciliation, including the inefficiency of matching multiple incomplete records to the same data subject, the loss of integrity when records are inconsistent, and the potential exposure of personally identifiable information (PII), sensitive business information, and/or any form of confidential information, when such attempts to match and reconcile data records are made. There is an increased need for tokenization of attributes and sets of attributes to increase the efficiency of data matching. It is with respect to these and other general considerations that the aspects disclosed herein have been made. Also, although relatively specific problems may be discussed, it should be understood that the examples should not be limited to solving the specific problems identified in the background or elsewhere in the disclosure. BRIEF DESCRIPTION OF THE DRAWINGS Non-limiting and non-exhaustive examples are described with reference to the following figures. FIG. 1 illustrates an example of a distributed system for performant data matching, as described herein. FIG. 2 illustrates an example input processor for implementing systems and methods for performant data matching, as described herein. FIG. 3 illustrates an example method for matching token sets using performant data matching techniques. FIG. 4 illustrates an example of a distributed system that includes a Customer environment and a Consolidation Platform environment for performant data matching. FIG. 5 illustrates example data sources and tokens. FIG. 6 illustrat