US-12626155-B2 - Machine learning model for entity resolution

US12626155B2US 12626155 B2US12626155 B2US 12626155B2US-12626155-B2

Abstract

In some implementations, a system may define common attributes of a first dataset and a second dataset. The system may generate a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on candidate generation criteria associated with a related pair of common attributes. The system may generate feature sets for the candidate set of mappings based on the common attributes and a featurization configuration. The system may train a machine learning model for performing entity resolution between the first dataset and the second dataset. The system may perform entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model.

Inventors

Fan Feng
Allison Fenichel
Illiana REED

Assignees

CAPITAL ONE SERVICES, LLC

Dates

Publication Date: 20260512
Application Date: 20210210

Claims (20)

1 . A system for generating a trained machine learning model for performing entity resolution, the system comprising: one or more memories; and one or more processors, communicatively coupled to the one or more memories, configured to: receive, from a client device, information identifying a first dataset and a second dataset; estimate and reserve, based on the first dataset and the second dataset, hardware device memory resources for training a machine learning model, by determining a first size of the first dataset and a second size of the second dataset and utilizing a trained machine learning memory prediction model that inputs the first size and the second size and outputs a prediction of hardware device memory resources usage; define a first set of common attributes of the first dataset and a second set of common attributes of the second dataset; identify unshared attributes of the first dataset and the second dataset; define candidate generation criteria relating to at least one pair of related common attributes from the first set of common attributes and the second set of common attributes; generate a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on the candidate generation criteria; determine a featurization configuration for the first set of common attributes and the second set of common attributes; generate feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and the featurization configuration and the candidate set; receive, from the client device, model configuration information; train, based on the reserved memory resources, the machine learning model for performing entity resolution between the first dataset and the second dataset based on the model configuration information, resulting in the trained machine learning model, wherein training the machine learning model includes a cross-validation procedure that comprises: splitting a training set into a training group and a hold-out group; generating a cross-validation score based on testing the machine learning model on the hold-out group after training the machine learning model on the training group and based on not using a test set; and combining generated cross-validation scores of training procedures to generate an overall cross-validation score for the trained machine learning model; and wherein a plurality of trained machine learning models are generated based on performing the cross-validation for a plurality of machine learning algorithms; perform entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model, selected from the plurality of trained machine learning models, wherein the selection is based on a highest overall cross-validation score based on selecting the trained machine learning model corresponding to the highest overall cross-validation score; generate, based on using information identifying the unshared attributes and based on results from performing the entity resolution, a merged dataset from the first dataset and the second dataset, wherein the merged dataset combines common entries from the first dataset and the second dataset and includes the unshared attributes, and wherein the results cause an automated action to be performed that includes combining or not combining entities into a same entry in the merged dataset; and process the merged dataset that includes common entries from the first and second datasets and unshared attributes from the first and the second data sets, wherein the merged dataset is stored for further data processing operations associated with the merged dataset.
2 . The system of claim 1 , wherein the one or more processors, when defining the first set of common attributes the second set of common attributes, are configured to: receive, from the client device, information identifying one or more common attributes of the first set of common attributes and one or more common attributes of the second set of common attributes.
3 . The system of claim 2 , wherein the one or more processors, when receiving the information identifying the one or more common attributes of the first set of common attributes and the one or more common attributes of the second set of common attributes, are further configured to: determine a first list of attribute names associated with the first dataset and a second list of attribute names associated with the second dataset; transmit, to the client device the first list of attribute names and the second list of attribute names; and receive, from the client device, a selection of the one or more common attributes of the first set of common attributes from the first list of attribute names and a selection of the one or more common attributes of the second set of common attributes from the second list of attribute names.
4 . The system of claim 1 , wherein the one or more processors, when defining the first set of common attributes and the second set of common attributes, are configured to: determine a first list of attribute names associated with the first dataset and a second list of attribute names associated with the second dataset; and generate the first set of common attributes and the second set of common attributes based on the first list of attribute names and the second list of attribute names using a trained machine learning attribute selection model.
5 . The system of claim 1 , wherein the one or more processors, when defining the candidate generation criteria, are configured to: transmit, to the client device, one or more candidate generation criteria options for one or more pairs of related common attributes from the first set of common attributes and the second set of common attributes; and receive, from the client device, a selection of the at least one pair of related common attributes from the one or more pairs of related common attributes, and a selection of the candidate generation criteria from the one or more candidate generation criteria options for the at least one pair of related common attributes.
6 . The system of claim 1 , wherein the one or more processors, when determining the featurization configuration, are configured to: determine, for each common attribute of the first set of common attributes and the second set of common attributes, one or more featurization options based on a type of attribute value associated with that common attribute; transmit, to the client device, the one or more featurization options determined for each common attribute of the first set of common attributes and the second set of common attributes; and receive, from the client device, a selection of a featurization option for each common attribute of the first set of common attributes and the second set of common attributes.
7 . The system of claim 1 , wherein the one or more processors, when generating the trained machine learning model, are configured to: train the machine learning model based on ground truth mappings for a subset of entities in the first dataset and the second dataset.
8 . The system of claim 7 , wherein the model configuration information includes a hyperparameter relating to a complexity of the trained machine learning model, and the one or more processors, when training the machine learning model, are configured to: train the machine learning model with the complexity based on the hyperparameter included in the model configuration information.
9 . The system of claim 1 , wherein the model configuration information includes an indication of a type of machine learning model, and the one or more processors, when training the machine learning model, are configured to: train the of type machine learning model indicated in the model configuration information.
10 . The system of claim 1 , wherein the one or more processors, when training the machine learning model, are configured to: train multiple types of machine learning models based on ground truth mappings; test the multiple types of machine learning models; and select the trained machine learning model from the multiple types of machine learning models based on testing the multiple types of machine learning models.
11 . The system of claim 1 , wherein the model configuration information includes a precision threshold, and the one or more processors, when performing entity resolution between the first dataset and the second dataset, are configured to: calculate, using the trained machine learning model and the first dataset and the second dataset, a respective probability score for each mapping in the candidate set of mappings; and compare the respective probability score for each mapping in the candidate set of mappings with the precision threshold to determine a set of resolved mappings between related entities in the first dataset and the second dataset.
12 . The system of claim 1 , wherein the one or more processors, when estimating and reserving the hardware device memory resources, are configured to: estimate memory resources for generating the trained machine learning model and performing entity resolution between the first dataset and the second dataset using the trained machine learning model based on a size of the first dataset and a size of the second dataset; and reserve the memory resources estimated for generating the trained machine learning model and performing entity resolution.
13 . The system of claim 1 , wherein the one or more processors are further configured to: perform cross-validation when training the machine learning model, wherein the cross-validation comprises: splitting a training set into a number of groups; marking the number of groups as a hold-out group or a training group; and training the machine learning model on the training group and testing the trained machine learning model based on the hold-out group to generate a cross-validation score; and select the machine learning model from a plurality of machine learning models as the trained machine learning model to be used for subsequent processes.
14 . The system of claim 1 , wherein the further data processing operations include tasks associated with financial transactions that include operations associated with entity resolution.
15 . A method for performing entity resolution between a first dataset and a second dataset using a trained machine learning model, comprising: estimating and reserving, by a system and based on the first dataset and the second dataset, hardware device memory resources for training a machine learning model, by determining a first size of the first dataset and a second size of the second dataset and utilizing a trained machine learning memory prediction model that inputs the first size and the second size and outputs a prediction of hardware device memory resources usage; defining, by the system, a first set of common attributes of the first dataset and a second set of common attributes of the second dataset; identifying, by the system, unshared attributes of the first dataset and the second dataset; generating, by the system, a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on candidate generation criteria associated with a related pair of common attributes in the first set of common attributes and the second set of common attributes; generating, by the system, feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and a featurization configuration; training or selecting, by the system and based on the reserved memory resources, the trained machine learning model for performing entity resolution between the first dataset and the second dataset, wherein the training includes a cross-validation procedure that comprises: splitting a training set into a training group and a hold-out group; generating a cross-validation score based on testing the machine learning model on the hold-out group after training the machine learning model on the training group and based on not using a test set; and combining generated cross-validation scores of training procedures to generate an overall cross-validation score; and wherein a plurality of trained machine learning models are generated based on performing the cross-validation for a plurality of machine learning algorithms; performing, by the system, entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model, selected from the plurality of trained machine learning models, wherein the selection is based on a highest overall cross-validation score based on selecting the trained machine learning model corresponding to the highest overall cross-validation score; generating, by the system and based on using information identifying the unshared attributes and based on results from performing the entity resolution, a merged dataset from the first dataset and the second dataset, wherein the merged dataset combines related entries from the first dataset and the second dataset and includes the unshared attributes, and wherein the results cause an automated action to be performed that includes combining or not combining entities into a same entry in the merged dataset; and processing, by the system, the merged dataset that includes common entries from the first and second datasets and unshared attributes from the first and the second data sets, wherein the merged dataset is stored for further data processing operations.
16 . The method of claim 15 , wherein generating or selecting the trained machine learning model comprises: training the machine learning model based on ground truth data associated with a subset of entities of the first dataset and the second dataset, resulting in the trained machine learning model.
17 . The method of claim 15 , wherein performing entity resolution between the first dataset and the second dataset comprises: calculating probability scores for the candidate set of mappings using the trained machine learning model and based on the feature sets; and determining a set of resolved mappings from the candidate set of mappings based on the probability scores for the candidate set of mappings, wherein each resolved mapping in the set of resolved mappings is a mapping between related entities in the first dataset and the second dataset.
18 . A non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: receive information identifying a first dataset and a second dataset; estimate and reserve, based on the first dataset and the second dataset, hardware device memory resources for training a machine learning model, by determining a first size of the first dataset and a second size of the second dataset and utilizing a trained machine learning memory prediction model that inputs the first size and the second size and outputs a prediction of hardware device memory resources usage; define a first set of common attributes of a first dataset and a second set of common attributes of a second dataset; identify unshared attributes of the first dataset and the second dataset; generate a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on candidate generation criteria associated with a related pair of common attributes in the first set of common attributes and the second set of common attributes; generate feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and a featurization configuration; train or select, based on the reserved memory resources, a trained machine learning model for performing entity resolution between the first dataset and the second dataset based on model configuration information, wherein the training or selecting include a cross-validation procedure that comprises: splitting a training set into a training group and a hold-out group; generating a cross-validation score based on testing the machine learning model on the hold-out group after training the machine learning model on the training group and based on not using a test set; and combining generated cross-validation scores of training procedures to generate an overall cross-validation score for the trained machine learning model; and wherein a plurality of trained machine learning models are generated based on performing the cross-validation for a plurality of machine learning algorithms; perform entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model, selected from the plurality of trained machine learning models, wherein the selection is based on a highest overall cross-validation score based on selecting the trained machine learning model corresponding to the highest overall cross-validation score; generate, based on using information identifying the unshared attributes and based on results from performing the entity resolution, a merged dataset from the first dataset and the second dataset, wherein the merged dataset combines common entries from the first dataset and the second dataset and includes the unshared attributes, and wherein the results cause an automated action to be performed that includes combining or not combining entities into a same entry in the merged dataset; and process the merged dataset that includes common entries from the first and second datasets and unshared attributes from the first and the second data sets, wherein the merged dataset is stored for further data processing operations associated with the merged dataset.
19 . The non-transitory computer-readable medium of claim 18 , wherein the one or more instructions, when executed by the one or more processors, further cause the device to: transmit, to a client device, one or more options for at least one of the first set of common attributes, the second set of common attributes, the candidate generation criteria, the featurization configuration, or the model configuration information; and receive, from the client device, a selection from the one or more options for the at least one of the first set of common attributes, the second set of common attributes, the candidate generation criteria, the featurization configuration, or the model configuration information.
20 . The non-transitory computer-readable medium of claim 18 , wherein the one or more instructions, that cause the device to train or select the trained machine learning model for performing entity resolution, cause the device to: transmit, to the model training system, the model configuration information and a set of ground truth data for a subset of entities in the first dataset and the second dataset; and receive, from the model training system, the trained machine learning model, wherein the trained machine learning model is trained based on the ground truth data and at least one of a type of the trained machine learning model, a complexity of the trained machine learning model, or a precision parameter of the trained machine learning model is based on the model configuration information.

Description

BACKGROUND Entity resolution tasks involve disambiguating records that correspond to manifestations of real world entities across different datasets or within the same dataset. Entity resolution tasks may include eliminating duplicate copies of repeated data, clustering or grouping records that correspond to the same entity, identifying records that reference the same entity across different datasets, and/or converting data that represents entities with multiple representations into a standard form, among other examples. SUMMARY In some implementations, a system for generating a trained machine learning model for performing entity resolution includes one or more memories and one or more processors, communicatively coupled to the one or more memories, configured to: receive, from a client device, information identifying a first dataset and a second dataset; define a first set of common attributes of the first dataset and a second set of common attributes of the second dataset; define candidate generation criteria relating to at least one pair of related common attributes from the first set of common attributes and the second set of common attributes; generate a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on the candidate generation criteria; determine a featurization configuration for the first set of common attributes and the second set of common attributes; generate feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and the featurization configuration and the candidate set; receive, from the client device, model configuration information; train a machine learning model for performing entity resolution between the first dataset and the second dataset based on the model configuration information, resulting in a trained machine learning model; and perform entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model. In some implementations, a method for performing entity resolution between a first dataset and a second dataset using a trained machine learning model includes defining, by a system, a first set of common attributes of the first dataset and a second set of common attributes of the second dataset; generating, by the system, a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on candidate generation criteria associated with a related pair of common attributes in the first set of common attributes and the second set of common attributes; generating, by the system, feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and a featurization configuration; training or selecting, by the system, a trained machine learning model for performing entity resolution between the first dataset and the second dataset; and performing, by the system, entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model. In some implementations, a non-transitory computer-readable medium storing a set of instructions includes one or more instructions that, when executed by one or more processors of a device, cause the device to: define a first set of common attributes of a first dataset and a second set of common attributes of a second dataset; generate a candidate set of mappings between one or more entities in the first dataset and one or more entities in the second dataset based on candidate generation criteria associated with a related pair of common attributes in the first set of common attributes and the second set of common attributes; generate feature sets for the candidate set of mappings based on the first set of common attributes, the second set of common attributes, and a featurization configuration; train or select a trained machine learning model for performing entity resolution between the first dataset and the second dataset based on model configuration information; and perform entity resolution between the first dataset and the second dataset based on the feature sets for the candidate set of mappings using the trained machine learning model. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1E are diagrams of an example implementation relating to entity resolution using a trained machine learning model. FIG. 2 is a diagram illustrating an example of training a machine learning model in connection with entity resolution. FIG. 3 is a diagram illustrating an example of applying a trained machine learning model to a new observation associated with entity resolution. FIG. 4 is a diagram of an example environment in which systems and/or methods described herein may be implemented. FIG. 5 is a diagram of