US-20260127624-A1 - ENTITY UNIQUE IDENTIFIER GENERATION

US20260127624A1US 20260127624 A1US20260127624 A1US 20260127624A1US-20260127624-A1

Abstract

This disclosure describes approaches that can be used to identify records in data sources that relate to the same entity, for example the same individual. The approaches herein can be used to identify entities in large datasets efficiently using a combination of blocking, prediction, and clustering. The approach herein can include assigning unique identifiers to entities. New or updated records can be processed and assigned either existing unique identifiers or new unique identifiers.

Inventors

Amina Noor
Mohammad Mustafa Bari

Assignees

T-MOBILE USA, INC.

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A method for generating unique identifiers for entities in datasets, the method comprising: accessing a first dataset and a second dataset comprising a plurality of records; pre-processing the first dataset and the second dataset, wherein pre-processing comprises removing duplicate records from the first dataset and the second dataset, and wherein pre-processing comprises converting dates to a standardized format; applying one or more blocking rules to the plurality of records to determine a plurality of blocks each comprising a subset of the plurality of records, wherein the one or more blocking rules comprise equal address double metaphone and equal first name Soundex, wherein records are grouped into a same block if either rule is satisfied; applying, for each block of the plurality of blocks, one or more pairwise comparisons of fields in the records in the block to determine one or more match weights, wherein each pairwise comparison comprises a comparison field and at least one comparison level, wherein each comparison level is associated with a match weight, wherein the match weight is determined at least in part by training a model using an expectation maximization algorithm; determining, for at least one field, a term frequency adjustment, wherein the term frequency adjustment is used to adjust a match weight based on a frequency of a value of the at least one field in the first dataset and the second dataset; clustering the records of each block based on the determined match weights in a first set of clusters, wherein records are clustered if the match weights are above a threshold value; assigning a unique identifier (UID) to each cluster; accessing a set of new records; applying the one or more blocking rules to the new records to group the new records into a second plurality of blocks; applying the one or more pairwise comparisons to the new records in each of the second plurality of blocks; clustering the new records based on the pairwise comparisons to generate a second set of clusters; determining, for each cluster of the second set of clusters, if the cluster corresponds to a cluster of the first set of clusters; when the cluster of the second set of clusters corresponds to a cluster of the first set of clusters, assigning the UID of the cluster of the first set of clusters to the cluster of the second set of clusters; and when the cluster of the second set of clusters does not correspond to a cluster of the first set of clusters, assigning a new UID to the cluster of the second set of clusters.
2 . A method for generating unique identifiers for entities in datasets, the method comprising: accessing a first dataset and a second dataset comprising a plurality of records; pre-processing the first dataset and the second dataset, wherein pre-processing comprises removing duplicate records from the first dataset and the second dataset; applying one or more blocking rules to the plurality of records to determine a plurality of blocks each comprising a subset of the plurality of records; applying, for each block of the plurality of blocks, one or more pairwise comparisons of fields in the records in the block to determine one or more match weights, clustering the records of each block based on the determined match weights in a first set of clusters, wherein records are clustered if the match weights are above a threshold value; and assigning a unique identifier (UID) to each cluster.
3 . The method of claim 2 , further comprising: accessing a set of new records; applying the one or more blocking rules to the new records to group the new records into a second plurality of blocks; applying the one or more pairwise comparisons to the new records in each of the second plurality of blocks; clustering the new records based on the pairwise comparisons to generate a second set of clusters; determining, for each cluster of the second set of clusters, if the cluster corresponds to a cluster of the first set of clusters; when the cluster of the second set of clusters corresponds to a cluster of the first set of clusters, assigning the UID of the cluster of the first set of clusters to the cluster of the second set of clusters; and when the cluster of the second set of clusters does not correspond to a cluster of the first set of clusters, assigning a new UID to the cluster of the second set of clusters.
4 . The method of claim 2 , wherein the one or more blocking rules comprise a phonetic matching rule.
5 . The method of claim 2 , wherein the one or more blocking rules comprise equal address double metaphone and equal first name Soundex, wherein the records are grouped into a same block if either rule is satisfied.
6 . The method of claim 2 , wherein each pairwise comparison comprises a comparison field and at least one comparison level, wherein each comparison level is associated with a match weight, wherein the match weight is determined at least in part by training a model using an expectation maximization algorithm.
7 . The method of claim 6 , wherein the comparison field comprises first name and the comparison levels comprise Jaro-Winkler similarity and Levenshtein distance.
8 . The method of claim 2 , further comprising: determining, for at least one field, a term frequency adjustment, wherein the term frequency adjustment is used to adjust a match weight based on a frequency of a value of the at least one field in the first dataset and the second dataset.
9 . The method of claim 5 , wherein numbers in address values are not considered.
10 . The method of claim 2 , wherein the first dataset and the second dataset comprise a customer dataset and a marketing dataset, wherein the customer dataset includes account numbers associated with customers.
11 . The method of claim 2 , wherein preprocessing further comprises at least one of: converting dates to a standardized format, removing leading whitespace, removing trailing whitespace, removing prefixes, or removing suffixes.
12 . A system for generating unique identifiers for entities in datasets, the system comprising: at least one hardware processor; and a non-transitory medium having instructions stored thereon that, when executed by the at least one hardware processor, cause the system to: access a first dataset and a second dataset comprising a plurality of records; pre-process the first dataset and the second dataset, wherein pre-processing comprises removing duplicate records from the first dataset and the second dataset; apply one or more blocking rules to the plurality of records to determine a plurality of blocks each comprising a subset of the plurality of records; apply, for each block of the plurality of blocks, one or more pairwise comparisons of fields in the records in the block to determine one or more match weights, cluster the records of each block based on the determined match weights in a first set of clusters, wherein records are clustered if the match weights are above a threshold value; and assign a unique identifier (UID) to each cluster.
13 . The system of claim 12 , wherein the instructions are further configured to cause the system to: access a set of new records; apply the one or more blocking rules to the new records to group the new records into a second plurality of blocks; apply the one or more pairwise comparisons to the new records in each of the second plurality of blocks; cluster the new records based on the pairwise comparisons to generate a second set of clusters; determine, for each cluster of the second set of clusters, if the cluster corresponds to a cluster of the first set of clusters; when the cluster of the second set of clusters corresponds to a cluster of the first set of clusters, assign the UID of the cluster of the first set of clusters to the cluster of the second set of clusters; and when the cluster of the second set of clusters does not correspond to a cluster of the first set of clusters, assign a new UID to the cluster of the second set of clusters.
14 . The system of claim 12 , wherein the one or more blocking rules comprise a phonetic matching rule.
15 . The system of claim 12 , wherein the one or more blocking rules comprise equal address double metaphone and equal first name Soundex, wherein the records are grouped into a same block if either rule is satisfied.
16 . The system of claim 12 , wherein each pairwise comparison comprises a comparison field and at least one comparison level, wherein each comparison level is associated with a match weight, wherein the match weight is determined at least in part by training a model using an expectation maximization algorithm.
17 . The system of claim 16 , wherein the comparison field comprises first name and the comparison levels comprise Jaro-Winkler similarity and Levenshtein distance.
18 . The system of claim 12 , wherein the instructions are further configured to cause the system to: determining, for at least one field, a term frequency adjustment, wherein the term frequency adjustment is used to adjust a match weight based on a frequency of a value of the at least one field in the first dataset and the second dataset.
19 . The system of claim 12 , wherein the first dataset and the second dataset comprise a customer dataset and a marketing dataset, wherein the customer dataset includes account numbers associated with customers.
20 . The system of claim 12 , wherein preprocessing further comprises at least one of: converting dates to a standardized format, removing leading whitespace, removing trailing whitespace, removing prefixes, or removing suffixes.

Description

BACKGROUND Identifying data corresponding to the same entity across data sources is important but has proven challenging. This can be especially true for large datasets. BRIEF DESCRIPTION OF THE DRAWINGS Detailed descriptions of implementations of the present invention will be described and explained through the use of the accompanying drawings. FIG. 1 is a block diagram that illustrates a wireless communications system that can implement aspects of the present technology. FIG. 2 is a block diagram that illustrates 5G core network functions (NFs) that can implement aspects of the present technology. FIG. 3 is a drawing that schematically illustrates blocking according to some implementations. FIG. 4 is a drawing that schematically illustrates clustering according to some embodiments. FIG. 5 is a flowchart that illustrates an example process for training and deploying a matching model according to some embodiments. FIG. 6 is a flowchart that illustrates an example process for assigning unique identifiers according to some implementations. FIG. 7 shows a process for processing new or updated records according to some implementations. FIG. 8 is a flowchart that illustrates an example processing for initial and incremental linking according to some implementations. FIG. 9 is a block diagram that illustrates an example of a computer system in which at least some operations described herein can be implemented. The technologies described herein will become more apparent to those skilled in the art from studying the Detailed Description in conjunction with the drawings. Embodiments or implementations describing aspects of the invention are illustrated by way of example, and the same references can indicate similar elements. While the drawings depict various implementations for the purpose of illustration, those skilled in the art will recognize that alternative implementations can be employed without departing from the principles of the present technologies. Accordingly, while specific implementations are shown in the drawings, the technology is amenable to various modifications. DETAILED DESCRIPTION The description and associated drawings are illustrative examples and are not to be construed as limiting. This disclosure provides certain details for a thorough understanding and enabling description of these examples. One skilled in the relevant technology will understand, however, that the invention can be practiced without many of these details. Likewise, one skilled in the relevant technology will understand that the invention can include well-known structures or features that are not shown or described in detail, to avoid unnecessarily obscuring the descriptions of examples. Organizations often have a large amount of data related to entities, such as individual customers, business customers, leads, and so forth. This data can come from many different sources and can be stored in a variety of formats such as databases, spreadsheets, and so forth. Entity data can include internally-sourced data (e.g., account information), data provided by third parties (e.g., marketing firms, credit reporting firms), or both. Entity data can include a wide variety of identifying information, such as name, address, phone number, e-mail address, account number, and so forth. However, different sources of data can include different identifying information and, even when the same fields are included (e.g., name), errors, variations in data entry rules, etc., can mean that the same entity (e.g., the same person) has different identifying information in different data sources. Organizations can make use of entity data for many purposes, such as upselling, cross-selling, retention, finding new customers, and so forth. It can be important to identify data records across sources that correspond to the same entity. For example, a database of leads data can include existing customers. Sending a new customer offer to an existing customer can result in a poor customer experience and wasted marketing money. Typically, there is no unique identifier that spans datasets. For example, an internal database of customers can include account numbers, but marketing data from an external source would generally not have account numbers. The lack of a unique identifier can present significant challenges. For example, consider a wireless telecommunications company that offers both cellular service and wireless high speed internet (HSI) service. If the customers for the cellular phone service and the HSI service are stored in different databases and/or have different account numbers, it can be difficult to determine which cellular customers already have HSI. As a result, it can be difficult to target promotions or offers at customers who already have cellular service but do not have HSI. As another example, contact information for marketing campaigns can be purchased from third parties, but some of the individuals included in the contact information may alr