US-12625886-B2 - Method and system for automatic data clustering

US12625886B2US 12625886 B2US12625886 B2US 12625886B2US-12625886-B2

Abstract

A method for using machine learning models to automatically cluster data into meaningful populations and groups and to identify hidden patterns and structures in such data is provided. The method includes: receiving first information that relates to a group of entities; analyzing the first information with respect to a predetermined set of parameters and a predetermined set of data types; selecting, based on a result of the analysis, a first machine learning (ML) model from among a predetermined set of ML models; and using the selected first ML model to generate a report that includes second information that relates to at least one data cluster identified by the first ML model from the first information.

Inventors

Maria BELTRAN
Evelyn DELPH
Karen SOMES
Trupti JADHAV
Naman SETHI
Ian COLDREN
Danielle WANG
Sparsh SRIVASTAVA

Assignees

JPMORGAN CHASE BANK, N.A.

Dates

Publication Date: 20260512
Application Date: 20231127
Priority Date: 20231013

Claims (15)

1 . A method for clustering data, the method comprising: receiving, by at least one processor, first information that relates to a first plurality of entities; analyzing, by the at least one processor, the first information with respect to a predetermined set of parameters and a predetermined set of data types, comprising: identifying textual data within the first information; performing, in response to at least the identified textual data exceeding 50 characters within the first information, a topic modeling operation by which at least one keyword is identified; performing, for the identified textual data, a sentiment analysis that assigns, for each analyzed portion of the textual data, a determined sentiment including at least one from among a positive sentiment, a negative sentiment, and a neutral sentiment; selecting, by the at least one processor based on results of the analyzing, a particular machine learning (ML) model from among a predetermined plurality of ML models; generating, by the at least one processor by using the particular ML model, a first report that includes second information that relates to at least one data cluster identified by the particular ML model from the first information; wherein the results of the analyzing as used by the selecting includes the determined sentiment and the identified at least one keyword; and wherein the predetermined plurality of ML models comprises a first model that implements a hierarchical clustering algorithm, a second model that implements a database scan (DBSCAN) algorithm, a third model that implements a K-means algorithm, and a fourth model that implements a K-prototypes algorithm.
2 . The method of claim 1 , wherein the first information comprises a first dataset that is in a comma-separated values (CSV) format.
3 . The method of claim 2 , wherein the first dataset is received as an input Microsoft Excel file.
4 . The method of claim 1 , wherein the predetermined set of data types comprises at least one from among a numerical data type, an ordinal data type, a nominal data type, a categorical data type, and a textual data type.
5 . The method of claim 1 , wherein the at least one data cluster includes at least one from among a first data cluster that relates to a first plurality of clients that is associated with a first line of business (LOB), a first industry sector, and relatively low revenue values; a second data cluster that relates to a second plurality of clients that is associated with a second LOB, a second industry sector, and mid-range revenue values; and a third data cluster that relates to a third plurality of clients that is associated with a third LOB, a third industry sector, and relatively high revenue values.
6 . The method of claim 1 , wherein the selecting comprises determining, for each respective one from among the predetermined plurality of ML models, a corresponding silhouette score that relates to a respective cluster quality for clusters generated by the respective ML model.
7 . The method of claim 1 , the analyzing further comprising: preprocessing the first information by: removing sparsely populated and/or insignificant columns from any identified spreadsheet; removing extraneous blank spaces and/or special characters; filling in missing values; scaling data included in an input file; and detecting outlier data.
8 . A computing apparatus for clustering data, the computing apparatus comprising: a processor; a memory; and a communication interface coupled to each of the processor and the memory, wherein the processor is programmed to cooperate with instructions in memory to perform operations including: receive, via the communication interface, first information that relates to a first plurality of entities; analyze the first information with respect to a predetermined set of parameters and a predetermined set of data types, comprising: identify textual data within the first information; perform, in response to at least the identified textual data exceeding 50 characters within the first information, a topic modeling operation by which at least one keyword is identified; perform, for the identified textual data, a sentiment analysis that assigns, for each analyzed portion of the textual data, a determined sentiment including at least one from among a positive sentiment, a negative sentiment, and a neutral sentiment; select, based on results of the analyze, a particular machine learning (ML) model from among a predetermined plurality of ML models; generate, by using the particular ML model, a first report that includes second information that relates to at least one data cluster identified by the particular ML model from the first information; wherein the results of the analyzing as used by the selecting includes the determined sentiment and the identified at least one keyword; and wherein the predetermined plurality of ML models comprises a first model that implements a hierarchical clustering algorithm, a second model that implements a database scan (DBSCAN) algorithm, a third model that implements a K-means algorithm, and a fourth model that implements a K-prototypes algorithm.
9 . The computing apparatus of claim 8 , wherein the first information comprises a first dataset that is in a comma-separated values (CSV) format.
10 . The computing apparatus of claim 9 , wherein the first dataset is received as an input Microsoft Excel file.
11 . The computing apparatus of claim 8 , wherein the predetermined set of data types comprises at least one from among a numerical data type, an ordinal data type, a nominal data type, a categorical data type, and a textual data type.
12 . The computing apparatus of claim 8 , wherein the at least one data cluster includes at least one from among a first data cluster that relates to a first plurality of clients that is associated with a first line of business (LOB), a first industry sector, and relatively low revenue values; a second data cluster that relates to a second plurality of clients that is associated with a second LOB, a second industry sector, and mid-range revenue values; and a third data cluster that relates to a third plurality of clients that is associated with a third LOB, a third industry sector, and relatively high revenue values.
13 . The computing apparatus of claim 8 , wherein the processor is further configured to make the selection by determining, for each respective one from among the predetermined plurality of ML models, a corresponding silhouette score that relates to a respective cluster quality for clusters generated by the respective ML model.
14 . The computing apparatus of claim 8 , the analyze further comprising: preprocess the first information by: remove sparsely populated and/or insignificant columns from any identified spreadsheet; remove extraneous blank spaces and/or special characters; fill in missing values; scaling data included in an input file; and detect outlier data.
15 . A non-transitory computer readable storage medium storing instructions for clustering data, the storage medium comprising a second set of executable code which, when executed by a processor, causes the processor to perform operations comprising: receive first information that relates to a first plurality of entities; analyze the first information with respect to a predetermined set of parameters and a predetermined set of data types, comprising: identify textual data within the first information; perform, in response to at least the identified textual data exceeding 50 characters within the first information, a topic modeling operation by which at least one keyword is identified; perform, for the identified textual data, a sentiment analysis that assigns, for each analyzed portion of the textual data, a determined sentiment including at least one from among a positive sentiment, a negative sentiment, and a neutral sentiment; select, based on results of the analysis, a particular machine learning (ML) model from among a predetermined plurality of ML models; generate, by using the particular ML model, a first report that includes second information that relates to at least one data cluster identified by the particular ML model from the first information; wherein the results of the analyzing as used by the selecting includes the determined sentiment and the identified at least one keyword; and wherein the predetermined plurality of ML models comprises a first model that implements a hierarchical clustering algorithm, a second model that implements a database scan (DBSCAN) algorithm, a third model that implements a K-means algorithm, and a fourth model that implements a K-prototypes algorithm.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority benefit from Indian Application No. 202311069096, filed Oct. 13, 2023 in the India Patent Office, which is hereby incorporated by reference in its entirety. BACKGROUND 1. Field of the Disclosure This technology generally relates to methods and systems for organizing and clustering data, and more particularly to methods and systems for using machine learning models to automatically cluster data into meaningful populations and groups and to identify hidden patterns and structures in such data. 2. Background Information In a large institution that serves many clients, such as a financial institution or a bank, large amounts of data are generated and received on a daily basis. The sheer volume of the data often leads to questions about the make-up of the data, such as what groups and/or sub-groups and behaviors are present within the data. To answer such questions, the institution may rely on data scientists to conduct advanced analysis. A typical data science workflow would include steps of data preparation, feature engineering, model tuning, model selection, and model evaluation. However, such analysis generally requires significant expertise, and may also be quite time-consuming. As a result, there is a relatively high cost with respect to both time and resources to obtain important insights into the make-up of the data. Accordingly, there is a need for a mechanism for using machine learning models to automatically cluster data into meaningful populations and groups and to identify hidden patterns and structures in such data. SUMMARY The present disclosure, through one or more of its various aspects, embodiments, and/or specific features or sub-components, provides, inter alia, various systems, servers, devices, methods, media, programs, and platforms for methods and systems using machine learning models to automatically cluster data into meaningful populations and groups and to identify hidden patterns and structures in such data. According to an aspect of the present disclosure, a method for clustering data is provided. The method is implemented by at least one processor. The method includes: receiving, by the at least one processor, first information that relates to a first plurality of entities; analyzing, by the at least one processor, the first information with respect to a predetermined set of parameters and a predetermined set of data types; selecting, by the at least one processor based on a result of the analyzing, a particular machine learning (ML) model from among a predetermined plurality of ML models; and generating, by the at least one processor by using the particular ML model, a first report that includes second information that relates to at least one data cluster identified by the particular ML model from the first information. The predetermined plurality of ML models may include at least one from among a first model that implements a hierarchical clustering algorithm, a second model that implements a database scan (DBSCAN) algorithm, a third model that implements a K-means algorithm, and a fourth model that implements a K-prototypes algorithm. The first information may include a first dataset that is in a comma-separated values (CSV) format. The first dataset may be received as an input Microsoft Excel file. The predetermined set of data types may include at least one from among a numerical data type, an ordinal data type, a nominal data type, a categorical data type, and a textual data type. For textual data within the first information, the analyzing may include performing a sentiment analysis that assigns, for each analyzed portion of the textual data, at least one from among a positive sentiment, a negative sentiment, and a neutral sentiment. For textual data exceeding 50 characters within the first information, the analyzing may include performing a topic modeling operation by which at least one keyword is identified. The at least one data cluster may include at least one from among a first data cluster that relates to a first plurality of clients that is associated with a first line of business (LOB), a first industry sector, and relatively low revenue values; a second data cluster that relates to a second plurality of clients that is associated with a second LOB, a second industry sector, and mid-range revenue values; and a third data cluster that relates to a third plurality of clients that is associated with a third LOB, a third industry sector, and relatively high revenue values. The selecting may include determining, for each respective one from among the predetermined plurality of ML models, a corresponding silhouette score that relates to a respective cluster quality for clusters generated by the respective ML model. According to another exemplary embodiment, a computing apparatus for clustering data is provided. The computing apparatus includes a processor; a memory; and a communication interface couple