US-12619665-B2 - Systems and methods for analyzing media feeds
Abstract
Aspects of the present disclosure provide systems, methods, apparatus, and computer-readable storage media that support relevance-based analysis and filtering of documents and media for one or more enterprises. Aspects disclosed herein leverage custom-built taxonomies, natural language processing (NLP), and machine learning (ML) for identifying and extracting features from highly-relevant documents. The extracted features are vectorized and then filtered based on entities (e.g., enterprises, organizations, individuals, etc.) and compliance-based risks (e.g., illegal or non-compliant activities) that are highly relevant to a particular client. The filtered feature vectors are used to identify and highlight relevant information in the corresponding documents, enabling decision making to resolve compliance-related risks. The aspects described herein generate fewer false positive or otherwise less relevant results than conventional document screening applications or manual techniques.
Inventors
- Rogelio Escalona
- Andrew Petrosie
- Matthew Lawrence
- Joseph Santoru
- Bob Rhodes
- John Kennedy
- Spencer Torene
- Xiao Xiao
- Nathan Harris
- Mahesh Ramachandran
- Paul Cifarelli
- Chad Longo
- Katherine Kent
- Yelena Altman Shapiro
- Laura McCurdy
Assignees
- THOMSON REUTERS ENTERPRISE CENTRE GMBH
Dates
- Publication Date
- 20260505
- Application Date
- 20211123
Claims (19)
- 1 . A method for relevance-based document analysis and filtering, the method comprising: performing, by one or more processors, natural language processing (NLP) operations on a labeled corpus of documents to generate labeled NLP data, wherein, for one or more documents of the labeled corpus of documents, an entity label corresponds to an entity within the document, a risk label corresponds to a risk within the document, and a relationship label indicates whether the entity and the risk are related, and wherein at least a portion of the labeled corpus of documents includes information indicating a plurality of compliance-related risks for a particular industry or sector; providing, by the one or more processors, the labeled NLP data to one or more machine learning (ML) models to train the one or more ML models to extract feature vectors of particular features from received documents, wherein the particular features include an entity, a risk, and a relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one relationship indicator indicating that a relationship between the entity and the risk does not exist within the corresponding document; performing, by one or more processors, one or more NLP operations on one or more input documents to generate document text data; generating, by the one or more processors, one or more feature vectors for the document text data by providing the document text data as input to the ML models, wherein the one or more ML models are trained to extract feature vectors of particular features from input documents, and wherein the one or more feature vectors include a vector with: a first vector element corresponding to an entity, a second vector element corresponding to a risk, and a third vector element corresponding to a relationship indicator, the relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one feature vector having the third vector element indicating that a relationship between the entity and the risk does not exist within the corresponding document; performing, by the one or more processors, entity matching between the one or more feature vectors and an entity list to generate scores corresponding to the one or more feature vectors, wherein each of the scores indicates a likelihood that a respective feature vector includes an entity from the entity list; filtering, by the one or more processors, the one or more feature vectors based on the scores to generate a subset of feature vectors; storing, by the one or more processors, the subset of feature vectors; generating, by the one or more processors, an output based on the subset of feature vectors; and displaying, by the one or more processors, portions of at least a subset of the input documents based on the output via graphical user interface (GUI).
- 2 . The method of claim 1 , further comprising associating, by the one or more processors, metadata with the one or more feature vectors, the metadata identifying, for each of the one or more feature vectors, an input document of the one or more input documents that corresponds to the feature vector, wherein the GUI displays portions of input documents indicated by metadata, and wherein the metadata is generated based on an entity and a risk identified from a common portion of the corresponding input document.
- 3 . The method of claim 1 , wherein the output comprises an instruction to automatically perform a compliance-related activity based on one or more entities, one or more risks, or both, included in the subset of feature vectors.
- 4 . The method of claim 1 , further comprising performing, by the one or more processors and prior to performing the entity matching, entity resolution to differentiate one or more entities included in the one or more feature vectors based on corresponding risks included in the one or more feature vectors, other information in the one or more input documents corresponding to the one or more feature vectors, or a combination thereof.
- 5 . A system for relevance-based document analysis and filtering, the system comprising: a memory; and one or more processors communicatively coupled to the memory, the one or more processors configured to: perform, by the one or more processors, natural language processing (NLP) operations on a labeled corpus of documents to generate labeled NLP data, wherein, for one or more documents of the labeled corpus of documents, an entity label corresponds to an entity within the document, a risk label corresponds to a risk within the document, and a relationship label indicates whether the entity and the risk are related, and wherein at least a portion of the labeled corpus of documents includes information indicating a plurality of compliance-related risks for a particular industry or sector; provide, by the one or more processors, the labeled NLP data to one or more machine learning (ML) models to train the one or more ML models to extract feature vectors of the particular features from received documents, wherein the particular features include an entity, a risk, and a relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one relationship indicator indicating that a relationship between the entity and the risk does not exist within the corresponding document; perform one or more NLP operations on one or more input documents to generate document text data; generate one or more feature vectors by providing the document text data as input to the one or more ML models, wherein the one or more ML models are trained to extract feature vectors of particular features from input documents, and wherein individual feature vectors of the one or more feature vectors include: a first vector element corresponding to an entity, a second vector element corresponding to a risk, and a third vector element corresponding to a relationship indicator, the relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one feature vector having the third vector element indicating that a relationship between the entity and the risk does not exist within the corresponding document; perform entity matching between the one or more feature vectors and an entity list to generate scores corresponding to the one or more feature vectors, wherein each of the scores indicates a likelihood that a respective feature vector includes an entity from the entity list; filter the one or more feature vectors based on the scores to generate a subset of feature vectors; store the subset of feature vectors; generate an output based on the subset of feature vectors; and display portions of at least a subset of the input documents based on the output via graphical user interface (GUI).
- 6 . The system of claim 5 , wherein the one or more processors are further configured to perform the entity matching on a first feature vector of the one or more feature vectors by comparing each field of a first entity in the first feature vector to each field of entities in the entity list, and wherein fields of entities included in the one or more feature vectors comprise a first name, a middle name, a last name, a title, an enterprise name, a business name, or a combination thereof.
- 7 . The system of claim 6 , wherein a score corresponding to the first feature vector is based on a number of fields that match between the first entity and an entry in the entity list, a degree of similarity between the fields of the first entity and the fields of the entity in the entity list, or a combination thereof.
- 8 . The system of claim 6 , wherein the one or more processors are further configured to: access one or more data sources to determine one or more additional entity names or field entries; and augment the entity list to include the one or more additional entity names or field entries prior to performance of the entity matching between the one or more feature vectors and the entity list.
- 9 . The system of claim 5 , wherein the one or more processors are further configured to perform risk matching between the one or more feature vectors and a risk list to generate risk scores corresponding to the one or more feature vectors, wherein each of the risk scores indicates a likelihood that a respective feature vector includes a risk from the risk list, and wherein the one or more feature vectors are filtered based further on the risk scores.
- 10 . A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for relevance-based document analysis and filtering, the operations comprising: performing, by the one or more processors, natural language processing (NLP) operations on a labeled corpus of documents to generate labeled NLP data, wherein, for one or more documents of the labeled corpus of documents, an entity label corresponds to an entity within the document, a risk label corresponds to a risk within the document, and a relationship label indicates whether the entity and the risk are related, and wherein at least a portion of the labeled corpus of documents includes information indicating a plurality of compliance-related risks for a particular industry or sector; providing, by the one or more processors, the labeled NLP data to one or more machine learning (ML) models to train the one or more ML models to extract feature vectors of the particular features from received documents, wherein the particular features include an entity, a risk, and a relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one relationship indicator indicating that a relationship between the entity and the risk does not exist within the corresponding document; performing, by the one or more processors, one or more NLP operations on one or more input documents to generate document text data; generating one or more feature vectors for the document text data by providing, by the one or more processers, the document text data as input to the one or more ML models, wherein the one or more ML models are trained to extract feature vectors of particular features from input documents, and wherein the at least one feature vector of the one or more feature vectors includes a vector having: a first vector element corresponding to an entity, a second vector element corresponding to a risk, and a third vector element corresponding to a relationship indicator, the relationship indicator representing whether a relationship between the entity and the risk exists within a corresponding document, and wherein at least a subset of the feature vectors comprise at least one vector element indicating that a relationship between the entity and the risk does not exist within the corresponding document; performing, by the one or more processors, entity matching between the one or more feature vectors and an entity list to generate scores corresponding to the one or more feature vectors, wherein each of the scores indicates a likelihood that a respective feature vector includes an entity from the entity list; filtering, by the one or more processors, the one or more feature vectors based on the scores to generate a subset of feature vectors; storing, by the one or more processors, the subset of feature vectors; generating, by the one or more processors, an output based on the subset of feature vectors; and displaying, by the one or more processors, portions of at least a subset of the input documents based on the output via graphical user interface (GUI).
- 11 . The non-transitory computer-readable storage medium of claim 10 , wherein the filtering comprises discarding feature vectors corresponding to scores that fail to satisfy a threshold, discarding a threshold number of lowest scoring feature vectors, or a combination thereof.
- 12 . The non-transitory computer-readable storage medium of claim 10 , wherein the operations further comprise providing the subset of feature vectors to one or more client devices associated with document experts, and wherein the subset of feature vectors is further filtered based on user input received from the one or more client devices.
- 13 . The non-transitory computer-readable storage medium of claim 10 , wherein the operations further comprise generating relevance ratings based on the scores of the subset of feature vectors, and wherein the output indicates the relevance ratings.
- 14 . The method of claim 1 , wherein the NLP operations comprise tokenization, lemmatization, stemming, phrasing, sentencization, part-of-speech tagging, dependency parsing, stop-character parsing, named entity recognition, or a combination thereof.
- 15 . The method of claim 1 , further comprising identifying, by the one or more processors, an entity and a risk as co-occurring within a same portion of an input document, and wherein the relationship indicator represents whether a semantic relationship exists or does not exist between the co-occurring entity and risk within that portion of the input document.
- 16 . The method of claim 1 , further comprising associating, by the one or more processors, metadata with the one or more feature vectors, the metadata including a word count corresponding to a location of an extracted vector element in the one or more input documents.
- 17 . The method of claim 1 , wherein the third vector element includes a binary value.
- 18 . The method of claim 1 , wherein the third vector element indicates an active status or an expired status.
- 19 . The method of claim 1 , wherein the third vector element indicates a predicate of a sentence connecting a subject and an object.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS The present application claims the benefit of priority from U.S. Provisional Patent Application No. 63/117,811 filed Nov. 24, 2020 and entitled “SYSTEMS AND METHODS OF ANALYZING MEDIA FEEDS,” the disclosure of which is incorporated by reference herein in its entirety. TECHNICAL FIELD The present disclosure relates generally to analyzing and filtering a plurality of documents to improve the relevance of filtered documents. More particularly, compliance-based filtering is described to filter documents that are highly relevant toward compliance with government or industry sector rules and regulations. BACKGROUND Changes in government regulations and industry rules increase the cost and complexity of compliance for enterprises or organizations. As a particular example, the regulatory compliance environment for financial institutions is increasingly strict, with unprecedented fines and attention paid to both intent and action by executives and compliance officers. To illustrate, it is estimated that $36 billion dollars in compliance fines have been assessed on financial institutions since 2008 in an attempt to curb the estimated $3 trillion in money laundering transactions, only 1% of the illicit financial flows of which are seized. The explosive growth of online news and world-wide news articles has resulted in a data overload for financial institutions attempting to comply with these regulations. For example, there has been an estimated 1000% increase in false negative news reports and false suspicious activity alerts in recent years. This increase in false positive articles and in complexity of regulations and rules poses challenges for financial institutions from a compliance perspective. To respond to the risks associated with failing to comply with regulations and rules, some financial institutions are investing in implementing costly compliance departments staffed with an increasing number of workers, which are typically unable to keep up with the increased volume of news articles to be reviewed for possible compliance issues. Additionally, results generated by these analysts are frequently subject to analyst bias and inconsistencies, and typically based almost exclusively on web-searched content. Some other financial institutions have resorted to outsourcing compliance review and reporting, which is often incomplete, inaccurate, and costly. Although some automated solutions exist, these automated systems provide un-vetted reporting (e.g., that places the burden on the enterprise receiving the reporting) and high volumes of false positives. Additionally, automated alerting systems are designed to search for generalized risks and not for specific, highly relevant risks to an individual enterprise or industry sector. For these reasons, compliance officers are overloaded, resulting in actual suspicious activity going unobserved and high risk entities, such as financial institutions, not being reviewed with sufficient frequency. SUMMARY Aspects of the present disclosure provide systems, methods, apparatus, and computer-readable storage media that support relevance-based analysis and filtering of documents and media for one or more enterprises, particularly in industry sectors associated with high risk of non-compliance with government regulations or industry rules. Aspects disclosed herein leverage custom-built taxonomies, natural language processing (NLP), and machine learning (ML) for identifying and extracting features from highly-relevant documents. The extracted features are then filtered based on entities (e.g., enterprises, organizations, executives, board members, employees, customers, etc.) and risks (e.g., non-compliance activities) that are highly relevant to a particular client. The filtered features are used to identify and highlight relevant information in the corresponding documents, thereby outputting relevant documents and highlighting for compliance offices of an organization to use in making important decisions. Optionally, the filtered articles may be provided to one or more document experts for additional filtering prior to being output. As such, the aspects disclosed herein perform adverse media screening to provide highly relevant documents with fewer false positives than conventional screening systems and techniques, decreasing the risk of non-compliance to an enterprise. Additionally, the systems and techniques described herein include automated (or semi-automated) systems that are lower cost and faster than conventional manual compliance review techniques or imprecise and inaccurate automated review systems or applications. To illustrate relevance-based analysis and filtering according to one or more aspects of the present disclosure, a server may include or have access to one or more ML models that are trained to extract particular features, such as entities, risks, and relationships between the entities and the risks, from documents. For exam