US-12625885-B1 - Unsupervised information-based hierarchical clustering of big data
Abstract
A hierarchical cluster analyzer identifies clusters in a big data set by identifying topological structure without distance-based metrics. The hierarchical cluster analyzer stochastically partitions the big data set to create pseudo-partitions of the big data set. The stochastic partitioning may be implemented with a random forest classifier that uses ensemble techniques to reduce variance and prevent overfitting. The hierarchical cluster analyzer implements random intersection leaves (RIL), a data mining technique that grows an intersection tree by intersecting candidate sets generated from the pseudo-partitions. The hierarchical cluster analyzer updates an association matrix according to co-occurrences of data points within each leaf node of the intersection tree. These co-occurring data points exhibit a high degree of similarity, which is recorded in the association matrix. A hierarchy of clusters may then be formed by finding community structure in the association matrix.
Inventors
- Marc Jaffrey
- Michael Dushkoff
Assignees
- Pattern Computer, Inc.
Dates
- Publication Date
- 20260512
- Application Date
- 20220718
Claims (18)
- 1 . A method, comprising: stochastically partitioning a data set to create a plurality of pseudo-partitions of the data set, the data set comprising a plurality of data points, each of the plurality of pseudo-partitions comprising a plurality of subsets whose union is a subset of the data set; forming each of a plurality of candidate sets based on one or more subsets selected from the plurality of subsets of each pseudo-partition of two or more pseudo-partitions of the plurality of pseudo-partitions; creating each of a plurality of intersection sets based on two or more candidate sets selected from the plurality of candidate sets; identifying, within each of the plurality of intersection sets, one or more co-occurrences of a plurality of co-occurrences of the plurality of data points; constructing a graph having a plurality of vertices connected by a plurality of edges, each vertex of the plurality of vertices representing a respective one of the plurality of data points, each edge of the plurality of edges connecting two vertices of the plurality of vertices, said each edge representing a number of the co-occurrences for the two data points represented by the two vertices; identifying a community in the graph, the community comprising member vertices of the plurality of vertices; and outputting a cluster of the data set, the cluster comprising member data points, of the plurality of data points, that are represented by the member vertices of the community.
- 2 . The method of claim 1 , wherein said creating the plurality of pseudo-partitions comprises partitioning the data set.
- 3 . The method of claim 1 , wherein said forming each of the plurality of candidate sets comprises: selecting the one or more subsets from the plurality of subsets of each pseudo-partition of the two or more pseudo-partitions; and taking a union of the selected one or more subsets.
- 4 . The method of claim 3 , wherein said selecting comprises randomly selecting the one or more subsets.
- 5 . The method of claim 1 , wherein said creating each of the plurality of intersection sets comprises: selecting the two or more candidate sets from the plurality of candidate sets; and taking an intersection of the selected two or more candidate sets.
- 6 . The method of claim 5 , wherein said selecting comprises randomly selecting the two or more candidate sets.
- 7 . The method of claim 1 , wherein said identifying the one or more co-occurrences comprises generating an association matrix of the plurality of co-occurrences.
- 8 . The method of claim 7 , wherein said identifying the community comprises clustering the association matrix.
- 9 . The method of claim 1 , wherein said creating each of the plurality of intersection sets comprises growing an intersection tree with at least some of the plurality of candidate sets such that each of a plurality of leaf nodes of the intersection tree stores an intersection of several of the plurality of candidate sets.
- 10 . A system, comprising: storage configured to store a data set comprising a plurality of data points; a processor; and a hierarchical clustering engine, implemented as machine-readable instructions stored in a memory in electronic communication with the processor, that, when executed by the processor, control the system to: stochastically partition a data set to create a plurality of pseudo-partitions of the data set, each of the plurality of pseudo-partitions comprising a plurality of subsets whose union is a subset of the data set, form each of a plurality of candidate sets based on one or more subsets selected from the plurality of subsets of each pseudo-partition of two or more pseudo-partitions of the plurality of pseudo-partitions, create each of a plurality of intersection sets based on two or more candidate sets selected from the plurality of candidate sets, identify, with each of the plurality of intersection sets, one or more co-occurrences of a plurality of co-occurrences of the plurality of data points, construct a graph having a plurality of vertices connected by a plurality of edges, each vertex of the plurality of vertices representing a respective one of the plurality of data points, each edge of the plurality of edges connecting two vertices of the plurality of vertices, said each edge representing a number of the co-occurrences for the two data points represented by the two vertices, identify a community in the graph, the community comprising member vertices of the plurality of vertices, and output a cluster of the data set, the cluster comprising member data points, of the plurality of data points, that are represented by the member vertices of the community.
- 11 . The system of claim 10 , wherein the machine-readable instructions that, when executed by the processor, control the system to create the plurality of pseudo-partitions comprise machine-readable instructions that, when executed by the processor, control the system to partition the data set.
- 12 . The system of claim 10 , wherein the machine-readable instructions that, when executed by the processor, control the system to form each of the plurality of candidate sets comprise machine-readable instructions that, when executed by the processor, control the system to: select the one or more subsets from the plurality of subsets of each pseudo-partition of the two or more pseudo-partitions, and take a union of the selected one or more subsets.
- 13 . The system of claim 12 , wherein the machine-readable instructions that, when executed by the processor, control the system to select the one or more subsets comprise machine-readable instructions that, when executed by the processor, control the system to randomly select the one or more subsets.
- 14 . The system of claim 10 , wherein the machine-readable instructions that, when executed by the processor, control the system to create each of the plurality of intersection sets comprise machine-readable instructions that, when executed by the processor, control the system to: select the two or more candidate sets from the plurality of candidate sets, and take an intersection of the selected two or more candidate sets.
- 15 . The system of claim 14 , wherein the machine-readable instructions that, when executed by the processor, control the system to select the two or more candidate sets comprise machine-readable instructions that, when executed by the processor, control the system to randomly select the two or more candidate sets.
- 16 . The system of claim 10 , wherein the machine-readable instructions that, when executed by the processor, control the system to identify the one or more co-occurrences comprise machine-readable instructions that, when executed by the processor, control the system to generate an association matrix of the plurality of co-occurrences.
- 17 . The system of claim 16 , wherein the machine-readable instructions that, when executed by the processor, control the system to identify the community comprise machine-readable instructions that, when executed by the processor, control the system to cluster the association matrix.
- 18 . The system of claim 10 , wherein the machine-readable instructions that, when executed by the processor, control the system to create each of the plurality of intersection sets comprises machine-readable instructions that, when executed by the processor, control the system to grow an intersection tree with at least some of the plurality of candidate sets such that each of a plurality of leaf nodes of the intersection tree stores an intersection of several of the plurality of candidate sets.
Description
RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 16/417,945, filed May 21, 2019, which claims priority to U.S. Provisional Patent Application No. 62/674,373, filed May 21, 2018. Each of these related applications is incorporated herein by reference in its entirety. BACKGROUND The proliferation of low-cost sensors and information-sensing devices, combined with steep drops in the cost of storing data, has led to the generation of data sets so large that traditional techniques for managing, analyzing, storing, and processing these data sets are challenged. Data sets meeting this criterion, referred to in the art as “big data,” arise in many fields, including genomics, meteorology, high-energy physics, predictive analytics, and business informatics. Whether a data set is “big” or not depends less on its numerical size, and more on whether the quantity is so large that it requires new approaches to data management. For some fields, a big data set may mean a few terabytes; for other fields, such as particle accelerators and internet search engines (e.g., Google), “big” could mean exabytes. SUMMARY Cluster analysis seeks to group a set of data points into clusters so that the data points within a cluster are more similar to each other than to all the other data points in the set. Traditional clustering analysis techniques (e.g., k-means clustering, density-based clustering, single-linkage clustering, and complete-linkage clustering) are becoming increasingly challenged by the growing dimensionality of big data sets, particularly in the field of genomics, where studies utilizing technologies like DNA microarrays and RNA-Seq produce big data sets containing millions of data points, or more, that lie in a mathematical space with up to tens of thousands of dimensions, or more. Many prior-art clustering techniques quantify similarity of data points by using a distance-based metric (e.g., Euclidean distance or Manhattan distance) to determine how “close” data points are to each other in the mathematical space. However, points in a mathematical space become equidistant to each other in the limit of infinite dimensions. As a result, as the dimensionality of a data set grows, distance between data points becomes an increasingly meaningless measure of similarity. Described herein is a hierarchical cluster analyzer that improves on prior-art clustering algorithms by using an information-based metric, as opposed to a distance-based metric, to identify topological structure in a data set. Advantageously, the hierarchical cluster analyzer works identically regardless of the dimensionality of the mathematical space in which the data points lie, thereby overcoming the limitations of the prior art techniques described above. The hierarchical cluster analyzer features a data mining technique referred to herein as random intersection leaves (RIL), which cooperates with a stochastic partitioning function to generate an association matrix that represents the data points as a weighted graph. The data points may then be globally clustered by identifying community structure in the weighted graph. The stochastic partitioning function may be any machine-learning algorithm that uses stochastic methods to partition a set of inputted data points. For clarity in the following discussion, the stochastic partitioning function is presented as a random forest classifier that uses bootstrapping (i.e., random selection of data points with replacement) and feature bagging to produce a multitude of decision trees. In this case, the stochastic partitioning function outputs leaf nodes of the decision trees, also referred to herein as subsets γi. However, another stochastic partitioning function (e.g., artificial neural networks, convolutional neural networks, and restricted Boltzmann machines) may be used without departing from the scope hereof. To reveal the underlying topological structure of the data set, the hierarchical cluster analyzer uses RIL to create a plurality of candidate sets, each formed from the union of randomly selected subsets γi. These candidate sets are then randomly selected and intersected in a tree structure referred to herein as an intersection tree. Each leaf node of the intersection tree stores the intersection of several candidate sets. Data points in a leaf node are therefore elements of all the corresponding candidate sets used to grow the leaf node; due to their tendency to be repeatedly grouped together by the random forest, as revealed by their presence in the same leaf node, these data points therefore exhibit a high measure of similarity. For clarity in the following description, the term “leaf node” refers to the end nodes of an intersection tree, while leaf nodes of a decision tree (e.g., as produced by a random forest classifier) are referred to as subsets γi. Advantageously, varying the parameters of RIL changes the coarseness of the revealed structure, creating a full hie