US-12619823-B2 - Computer-implemented system and method to perform natural language processing entity research and resolution

US12619823B2US 12619823 B2US12619823 B2US 12619823B2US-12619823-B2

Abstract

A computer-implemented system to perform group entity identification, resolution and knowledge extraction is provided. The system receives an indication of one or more potentially related entities and basic attributes. The system then collects a plurality of content pages comprising candidate attribute data related to one or more candidate entities. Based on entity resolution configuration and entity-resolution module which employs deep-learning models, the system obtains initial additional confirmed entity attribute data or relevant attribute data. With additional knowledges acquired, the system iteratively goes over the same contents again and potentially classifies entities identified in the content pages to be at least confirmed, relevant, or irrelevant entities, until no more additional confident knowledges obtained for target entities during iteration. After iterations of entity resolution processes, the system finally extracts entity knowledge based on predefined knowledge map for individuals and business entities, summarization of knowledges for entities are then performed, results are displayed.

Inventors

Wei Sha
Yulia Kiryutina

Assignees

Wei Sha

Dates

Publication Date: 20260505
Application Date: 20230509

Claims (20)

1 . A computer-implemented method performed by a system comprising one or more processors, comprising: receiving an indication of at least one target entity and attribute data of the target entity, wherein the at least one target entity comprises at least one of a target individual entity and a target business entity, wherein the attribute data of target individual entity comprises at least one of name, age, address, contact information, and family members or relatives of the target individual entity, and negative news or crime activity related to the target individual entity, wherein the attribute data of target business entity comprises at least one of name of the target business entity, owner or manager names, registration information, office address, industry type, product or service type, affiliates, negative news or lawsuit related to the target business entity; obtaining, from the indication of the at least one target entity, attribute data of the target entity as confirmed attribute data, wherein obtaining attribute data of the target entity as the confirmed attribute data comprises: generating a confirmed database configured to store the confirmed attribute data; collecting a plurality of content pages comprising candidate attribute data related to one or more candidate entities; obtaining, from the plurality of content pages, candidate attribute data related to one or more candidate entities, wherein obtaining the candidate attribute data comprises: generating a candidate database configured to store candidate attribute data; after collecting the plurality of content pages and obtaining confirmed attributed data of the target entity and candidate attribute data related to one or more candidate entities, iteratively classifying, based on one or more machine-learning models, the one or more candidate entities identified in the plurality of content pages, by performing, during one iteration of a plurality of iterations, steps of: analyzing similarities of the candidate attribute data with respect to the confirmed attribute data, wherein analyzing similarities of the candidate attribute data with respect to the confirmed attribute data comprising calculating similarity scores based on the one or more machine-learning models, classifying, based on the calculated similarity scores and a plurality of threshold values for the one or more machine-learning models, the one or more candidate entities to be at least confirmed, relevant, or irrelevant entities based on the similarities of the candidate attribute data, classifying corresponding candidate attribute data related to the candidate entities to be confirmed, relevant, or irrelevant attribute data, and moving candidate attribute data classified as confirmed attribute data from the candidate database to the confirmed database, discarding content pages associated with candidate entities classified as irrelevant, wherein the steps performed during a current iteration form a basis of a more accurate classification in a next iteration; displaying the plurality of content pages with one or more confirmed entities highlighted in each of the plurality of content pages, the one or more confirmed entities being identified as corresponding to the at least one target entity; and triggering at least one action based on the plurality of content pages, the one or more confirmed entities, the classification of candidate entities, or a combination thereof.
2 . The method according to claim 1 , wherein obtaining the confirmed attribute data and the candidate attribute data comprises using a natural entity recognition (NER) deep learning model, machine question-answering (QA) deep learning model or a pattern recognition model to digest the indication of the at least one target entity and the plurality of content pages.
3 . The method according to claim 1 , wherein obtaining the confirmed attribute data and the candidate attribute data comprises obtaining confirmed geographic data and candidate geographic data.
4 . The method according to claim 3 , wherein obtaining the confirmed geographic data and the candidate geographic data comprises obtaining latitude-longitude coordinate data of the confirmed geographic data and latitude-longitude coordinate data of the candidate geographic data.
5 . The method according to claim 4 , wherein analyzing similarities of the candidate geographic data with respect to the confirmed geographic data comprises calculating a distance between the latitude-longitude coordinate data of the candidate geographic data and the latitude-longitude coordinate data of the confirmed geographic data.
6 . The method according to claim 1 , wherein analyzing the similarities of the candidate attribute data with respect to the confirmed attribute data comprises performing a name matching process.
7 . The method according to claim 6 , wherein performing the name matching process is based on at least one of a pre-trained natural language processing (NLP) deep learning model or a fuzzy matching deep learning model.
8 . The method according to claim 6 , wherein performing the name matching process comprises identifying at least one of names comprising nick names, phonetic variations, typographical mistakes, contextual differences, reordered terms, prefixes and suffixes, abbreviations and initials, or truncated letters and missing as matched names.
9 . The method according to claim 1 , wherein classifying the one or more candidate entities to be at least confirmed, relevant, or irrelevant comprises: obtaining an entity resolution map for each of the candidate entities identified in the plurality of content pages using the similarities, calculating an entity resolution score based on the entity resolution map, and classifying the one or more candidate entities to be at least confirmed, relevant, or irrelevant based on the entity resolution score.
10 . The method according to claim 9 , wherein classifying the one or more candidate entities to be at least confirmed, relevant, or irrelevant based on the entity resolution score comprises: determining whether the entity resolution score is greater than or equal to a first threshold value; and classifying the one or more candidate entities in response to that the entity resolution score is greater than or equal to the first threshold value.
11 . The method according to claim 1 , wherein triggering the at least one action comprises generating, based on information clustering, a summarization map for the candidate entities using the classification of the candidate entities.
12 . The method according to claim 11 , wherein generating the summarization map for the candidate entities comprises: predicting relevance for the candidate attribute data related to the candidate entities based on the classification of the candidate entities; and extracting most relevant attribute data for the candidate entities based on the predicted relevance.
13 . The method according to claim 12 , wherein the predicting relevance for the candidate attribute data comprises performing an information similarity evaluation.
14 . The method according to claim 12 , wherein the extracting most relevant attribute data comprises using at least one information extraction deep learning model.
15 . A computer-implemented system, comprising: one or more processors and one or more non-transitory memory storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: receiving an indication of at least one target entity and confirmed attribute data of the target entity, wherein the at least one target entity comprises at least one of a target individual entity and a target business entity, wherein the attribute data of target individual entity comprises at least one of name, age, address, contact information, and family members or relatives of the target individual entity, and negative news or crime activity related to the target individual entity, wherein the attribute data of target business entity comprises at least one of name of the target business entity, owner or manager names, registration information, office address, industry type, product or service type, affiliates, negative news or lawsuit related to the target business entity; obtaining, from the indication of the at least one target entity, attribute data of the target entity as confirmed attribute data, wherein obtaining attribute data of the target entity as the confirmed attribute data comprises: generating a confirmed database configured to store the confirmed attribute data; collecting a plurality of content pages comprising candidate attribute data related to one or more candidate entities; obtaining, from the plurality of content pages, candidate attribute data related to one or more candidate entities, wherein obtaining the candidate attribute data comprises: generating a candidate database configured to store candidate attribute data; after collecting the plurality of content pages and obtaining confirmed attributed data of the target entity and candidate attribute data related to one or more candidate entities, iteratively classifying, based on one or more machine-learning models, the one or more candidate entities identified in the plurality of content pages, by performing, during one iteration of a plurality of iterations, steps of: analyzing similarities of the candidate attribute data with respect to the confirmed attribute data, wherein analyzing similarities of the candidate attribute data with respect to the confirmed attribute data comprising calculating similarity scores based on the one or more machine-learning models, classifying, based on the calculated similarity scores and a plurality of threshold values for the one or more machine-learning models, the one or more candidate entities to be at least confirmed, relevant, or irrelevant entities based on the similarities of the candidate attribute data, classifying corresponding candidate attribute data related to the candidate entities to be confirmed, relevant, or irrelevant attribute data, and moving candidate attribute data classified as confirmed attribute data from the candidate database to the confirmed database, discarding content pages associated with candidate entities classified as irrelevant, wherein the steps performed during a current iteration form a basis of a more accurate classification in a next iteration; displaying the plurality of content pages with one or more confirmed entities highlighted in each of the plurality of content pages, the one or more confirmed entities being identified as corresponding to the at least one target entity; and triggering at least one action based on the plurality of content pages, the one or more confirmed entities, the classification of candidate entities, or a combination thereof.
16 . The system according to claim 15 , wherein obtaining the confirmed attribute data and the candidate attribute data comprises using a natural entity recognition (NER) deep learning model, machine question-answering (QA) deep learning model, or a pattern recognition model to digest the indication of the at least one target entity and the plurality of content pages.
17 . The system according to claim 15 , wherein obtaining the confirmed attribute data and the candidate attribute data comprises obtaining confirmed geographic data and candidate geographic data.
18 . The system according to claim 17 , wherein obtaining the confirmed geographic data and the candidate geographic data comprising obtaining latitude-longitude coordinate data of the confirmed geographic data and latitude-longitude coordinate data of the candidate geographic data.
19 . The system according to claim 18 , wherein analyzing similarities of the candidate geographic data with respect to the confirmed geographic data comprises calculating a distance between the latitude-longitude coordinate data of the candidate geographic data and the latitude-longitude coordinate data of the confirmed geographic data.
20 . The system according to claim 15 , wherein analyzing the similarities of the candidate attribute data with respect to the confirmed attribute data comprises performing a name matching process.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Patent Application Ser. No. 63/339,930, filed May 9, 2022, entitled “A Novel Solution using AI Machine to Perform Compliance Research” the content of which is hereby incorporated by reference in its entirety for all purposes. FIELD The present disclosure relates generally to technologies in artificial intelligence (AI), deep learning, and natural language processing (NLP), more specifically, to computer-implemented systems and methods to perform entity resolution using NLP technologies. BACKGROUND In various applications, target entities often need to be searched from multiple databases, verified, and/or resolved by using a resource intensive and time-consuming process. This entity analysis process may be prone to errors, because it involves a lot of repetitive and tedious tasks. In addition, the processes cannot be audited. For example, when an indication comprising target entities is received by a computing device operated by a human researcher, the human researcher uses the computing device to research and identify the target entities that may be involved in an alerted case in the indication and digests knowledge of the target entities from the indication. The human researcher then instructs the computing device to perform searches for each of the target entities, reads each of content pages searched from websites, and identifies one or more target entities if the one or more target entities are present in the content pages. Once the one or more target entities are identified in the content pages, the human researcher obtains knowledge of the target entities. Based on the obtained knowledge of the target entities from the content pages, the human researcher may need to instruct the computing device to perform additional search on public or private databases, read additional content pages to identify the one or more target entities, and obtain additional knowledge of the one or more target entities, until the human researcher can assess the alerted case in the indication based on enough knowledge of the one or more target entities. Obtaining knowledge of the target entities by humans is a time-consuming process and prone of errors, especially when a typical computing device is not configured to perform such an entity analysis process. This is because knowledge of entities is miscellaneous and can be complicated, and normally unstructured in content pages. Therefore, it is difficult, time-consuming, and sometimes impractical to use a typical computing device to extract the knowledge from unstructured texts from each of the unstructured content pages. In comparison, however, the final step of assessing the alerted case in the indication does not require much time and can be straightforward, if the obtained knowledge of the target entities is sufficient and accurate. Therefore, there is a need to improve the performance (e.g., speed, accuracy, reliability, error rate, etc.) and efficiency of computing devices used for obtaining knowledge of the target entities, and thereby reducing the resource intensities and overall costs for entity resolution. Further, different human researchers using different computing devices may behave differently when obtaining knowledges of the target entity and performing an entity analysis process. For example, different researchers may determine different entities as target entities involved in an alerted case in a same indication. Different researchers may use different knowledge of entities while performing searches in public and private databases. Different researchers may obtain different knowledge of entities in content pages which may lead to different decisions for accessing the alerted case in the same indication. Due to individual performance differences in processes of searching, identifying target entities in content pages, and obtaining knowledges of the target entity, these processes are not reproducible even when the same computing device is used and therefore cannot be audited. Also, it is difficult to inspect errors and troubleshoot the errors in a lot of repetitive processes. Therefore, there is a need to standardize the processes of entity analysis by enhancing the capabilities of the computing devices in the process, thereby improving overall performance of the computing devices in such a process. SUMMARY Various systems, methods, and articles of computer-implemented systems and methods to perform entity resolution are described herein. In some embodiments, a computer-implemented system configured to perform an entity analysis process is provided. The technologies disclosed herein improves the accuracy of the process performed by the computer-implemented system, reduces or eliminates random human or computer errors, increases the speed of performing the process, achieving a consistent and accurate performance, and providing a cost-efficient solution. The system receive