CN-114298124-B - Clustering method, clustering device, electronic equipment and computer readable storage medium

CN114298124BCN 114298124 BCN114298124 BCN 114298124BCN-114298124-B

Abstract

The application discloses a clustering method, a clustering device, electronic equipment and a computer readable storage medium, and belongs to the technical field of data processing. The method comprises the steps of obtaining a plurality of first clusters which are obtained by clustering a plurality of object data based on a first clustering algorithm, wherein one first cluster comprises at least one object data, obtaining a plurality of second clusters which are obtained by clustering a plurality of object data based on a second clustering algorithm, wherein one second cluster comprises at least one object data, determining a first cross table based on the plurality of first clusters and the plurality of second clusters, wherein one row of data of the first cross table represents each object data in one first cluster, one column of data of the first cross table represents each object data in one second cluster, and determining a plurality of third clusters based on the first cross table, wherein one third cluster comprises at least one object data. And determining clustering results of the plurality of object data by using two clustering algorithms, so that the accuracy of the clustering results is improved, and the accuracy of subsequent data processing is improved.

Inventors

WANG LIANG
YAO JIANHUA

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20211022

Claims (20)

1. A method of clustering, the method comprising: Acquiring a plurality of first clusters obtained by carrying out clustering processing on a plurality of object data based on a first clustering algorithm, wherein one first cluster comprises at least one object data, the object data comprises a gene expression matrix of cells, the gene expression matrix comprises a plurality of rows and a plurality of columns of data, the rows of the gene expression matrix represent the expression of a gene under different environmental conditions or at different time points, the columns of the gene expression matrix represent the expression conditions of the gene under different conditions or under a sample, and the data of any column of any row represent the expression level of the gene in one cell; acquiring a plurality of second clusters obtained by clustering the plurality of object data based on a second clustering algorithm, wherein one second cluster comprises at least one object data; Determining a first cross table based on the plurality of first clusters and the plurality of second clusters, a row of data of the first cross table characterizing each object data in one first cluster, a column of data of the first cross table characterizing each object data in one second cluster; A plurality of third clusters including at least one object data is determined based on the first interleaving table.
2. The method of claim 1, wherein a first cluster corresponds to a first cluster identifier, a second cluster corresponds to a second cluster identifier, and wherein the object data comprises an object identifier; The determining a first interleaving table based on the plurality of first clusters and the plurality of second clusters, comprising: Determining object identifications corresponding to each cluster identification set based on first cluster identifications of each first cluster, object identifications of each object data contained in each first cluster, second cluster identifications of each second cluster and object identifications of each object data contained in each second cluster, wherein one cluster identification set comprises one first cluster identification and one second cluster identification; and determining a first crossing table based on the number of object identifiers corresponding to each cluster identifier set.
3. The method of claim 1, wherein the first interleaving table includes a plurality of non-zero data, the non-zero data representing the number of identical object data contained in a first cluster corresponding to a row in which the non-zero data is located and a second cluster corresponding to a column in which the non-zero data is located; The determining a plurality of third clusters based on the first interleaving table includes: updating the first interleaving table based on the respective non-zero data in the first interleaving table; a plurality of third clusters is determined based on the updated first interleaving table.
4. The method of claim 3, wherein the updating the first interleaving table based on the respective non-zero data in the first interleaving table comprises: Determining non-zero data meeting a condition in each column based on the respective non-zero data contained in each column in the first interleaving table; And updating the first interleaving table based on the non-zero data contained in each column and the non-zero data meeting the conditions in each column.
5. The method of claim 4, wherein the non-zero data in each column that satisfies a condition comprises the largest non-zero data for each column and the next largest non-zero data for each column; The updating the first interleaving table based on the non-zero data contained in each column and the non-zero data meeting the condition in each column comprises the following steps: In response to the first column being present in each column, the largest non-zero data of the first column is greater than N times the second largest non-zero data of the first column, where N is a positive number greater than 1, determining first non-zero data, updating each non-zero data included in the first column except for the largest non-zero data of the first column to a target character, and modifying the largest non-zero data of the first column to the first non-zero data, where the first non-zero data is a sum of each non-zero data of the first column.
6. The method of claim 4, wherein the non-zero data in each column that satisfies a condition comprises the largest non-zero data for each column and the next largest non-zero data for each column; The updating the first interleaving table based on the non-zero data contained in each column and the non-zero data meeting the condition in each column comprises the following steps: In response to the second column being present in each of the columns, the largest non-zero data of the second column is not greater than N times the next largest non-zero data of the second column, the N being a positive number greater than 1, determining second non-zero data, updating each non-zero data of the second column other than the largest non-zero data of the second column, the next largest non-zero data of the second column, and the largest non-zero data of the second column to the second non-zero data, the second non-zero data being a sum of each non-zero data of the second column other than the next largest non-zero data of the second column.
7. The method of claim 4, wherein the non-zero data in each column that satisfies a condition comprises at least two non-zero data for each column that is greater than a reference value; The updating the first interleaving table based on the non-zero data contained in each column and the non-zero data meeting the condition in each column comprises the following steps: determining an average value of each column based on the at least two non-zero data for each column greater than a reference value; And updating the first crossing table based on the non-zero data contained in each column and the average value of each column.
8. The method of claim 7, wherein updating the first interleaving table based on the non-zero data contained in each column and the average value of each column comprises: In response to the presence of a third column in each column, the maximum non-zero data of the third column is greater than an average value of the third column by a factor of M, where M is a positive number greater than 1, determining third non-zero data, updating each non-zero data included in the third column except for the maximum non-zero data of the third column to a target character, and modifying the maximum non-zero data of the third column to the third non-zero data, which is a sum of each non-zero data of the third column.
9. The method of claim 7, wherein updating the first interleaving table based on the non-zero data contained in each column and the average value of each column comprises: in response to the fourth column being present in each of the columns, the maximum non-zero data of the fourth column is not greater than an average value of the fourth column by a factor of M, the M being a positive number greater than 1, fourth non-zero data is determined, each non-zero data included in the fourth column except for at least two non-zero data of the fourth column greater than a reference value is updated to a target character, the maximum non-zero data of the fourth column is modified to the fourth non-zero data, the fourth non-zero data being a sum of each non-zero data included in the fourth column except for at least two non-zero data of the fourth column greater than the reference value and the maximum non-zero data of the fourth column.
10. The method of claim 3, wherein the updating the first interleaving table based on the respective non-zero data in the first interleaving table comprises: determining a coefficient of variation for each column based on the respective non-zero data contained in each column in the first interleaving table, the coefficient of variation for any column being used to characterize the degree of dispersion of the respective non-zero data contained in any column; Updating the first interleaving table based on the coefficient of variation of each column.
11. The method of claim 10, wherein updating the first interleaving table based on the coefficient of variation for each column comprises: Determining fifth non-zero data in response to the existence of a fifth column in each column, wherein the variation coefficient of the fifth column is larger than a target variation coefficient, updating each non-zero data except the maximum non-zero data contained in the fifth column into a target character, and modifying the maximum non-zero data of the fifth column into the fifth non-zero data, wherein the fifth non-zero data is the sum of each non-zero data contained in the fifth column; And in response to the existence of a sixth column in each column, determining that the sixth column remains unchanged, wherein the coefficient of variation of the sixth column is not greater than the target coefficient of variation.
12. The method of claim 3, wherein the updating the first interleaving table based on the respective non-zero data in the first interleaving table comprises: determining non-zero data meeting a condition in each row based on the respective non-zero data contained in each row in the first interleaving table; And updating the first interleaving table based on the non-zero data contained in each row and the non-zero data meeting the conditions in each row.
13. The method of claim 3, wherein the updating the first interleaving table based on the respective non-zero data in the first interleaving table comprises: Determining a variation coefficient of each row based on the non-zero data contained in each row in the first intersecting table, wherein the variation coefficient of any row is used for representing the discrete degree of the non-zero data contained in any row; Updating the first interleaving table based on the variation coefficient of each row.
14. A method according to claim 3, wherein the number of updated first interleaving tables is at least two; The determining a plurality of third clusters based on the updated first interleaving table includes: Determining an evaluation index of each updated first intersection table, wherein the evaluation index is used for representing the accuracy of the updated first intersection table; Determining a second intersecting table corresponding to the maximum evaluation index from the updated first intersecting tables based on the evaluation index of the updated first intersecting tables; and determining a plurality of third clusters based on a second crossing table corresponding to the maximum evaluation index.
15. The method according to any one of claims 1 to 14, wherein the obtaining a plurality of first clusters obtained by clustering a plurality of object data based on a first clustering algorithm includes: clustering the gene expression matrixes of a plurality of cells by using a Lengton algorithm to obtain a plurality of first clusters; The obtaining a plurality of second clusters obtained by clustering the plurality of object data based on a second clustering algorithm comprises the following steps: clustering the gene expression matrix of the plurality of cells by using a state-of-the-art algorithm based on deep learning feature expression to obtain a plurality of second clusters.
16. The method according to any one of claims 1 to 14, wherein after determining a plurality of third clusters based on the first interleaving table, further comprising: acquiring a plurality of fourth clusters obtained by clustering the plurality of object data based on a third cluster algorithm, wherein one fourth cluster comprises at least one object data; determining a third intersection table based on the plurality of third clusters and the plurality of fourth clusters, a row of data of the third intersection table characterizing each object data in one third cluster, a column of data of the third intersection table characterizing each object data in one fourth cluster; a plurality of fifth clusters including at least one object data is determined based on the third intersection table.
17. A cluster processing apparatus, the apparatus comprising: The acquisition module is used for acquiring a plurality of first clusters obtained by clustering a plurality of object data based on a first clustering algorithm, wherein one first cluster comprises at least one object data, the object data comprises a gene expression matrix of cells, the gene expression matrix comprises a plurality of rows and a plurality of columns of data, the rows of the gene expression matrix represent the expression of a gene under different environmental conditions or at different time points, the columns of the gene expression matrix represent the expression of the gene under different conditions or samples, and the data of any row and any column of any row represent the expression level of a gene in one cell; The acquisition module is further configured to acquire a plurality of second clusters obtained by clustering the plurality of object data based on a second clustering algorithm, where one second cluster includes at least one object data; A determining module, configured to determine a first cross table based on the plurality of first clusters and the plurality of second clusters, where a row of data of the first cross table characterizes each object data in one first cluster, and a row of data of the first cross table characterizes each object data in one second cluster; The determining module is further configured to determine a plurality of third clusters based on the first intersecting table, where one third cluster includes at least one object data.
18. The apparatus of claim 17, wherein a first cluster corresponds to a first cluster identifier, a second cluster corresponds to a second cluster identifier, and wherein the object data comprises an object identifier; The determining module is used for determining object identifications corresponding to each cluster identification set based on first cluster identifications of each first cluster, object identifications of each object data contained in each first cluster, second cluster identifications of each second cluster and object identifications of each object data contained in each second cluster, wherein one cluster identification set comprises one first cluster identification and one second cluster identification, and determining a first cross table based on the number of the object identifications corresponding to each cluster identification set.
19. The apparatus of claim 17, wherein the first interleaving table comprises a plurality of non-zero data, wherein the non-zero data characterizes a number of identical object data contained in a first cluster corresponding to a row in which the non-zero data is located and a second cluster corresponding to a column in which the non-zero data is located; the determining module is used for updating the first intersecting table based on the non-zero data in the first intersecting table, and determining a plurality of third clusters based on the updated first intersecting table.
20. The apparatus of claim 19, wherein the means for determining is configured to determine non-zero data in each column that satisfies a condition based on the respective non-zero data in each column in the first interleaving table, and to update the first interleaving table based on the respective non-zero data in each column and the non-zero data in each column that satisfies a condition.

Description

Clustering method, clustering device, electronic equipment and computer readable storage medium Technical Field The embodiment of the application relates to the technical field of data processing, in particular to a clustering processing method, a clustering processing device, electronic equipment and a computer readable storage medium. Background In the technical field of data processing, clustering is a basic data processing mode. The clustering process can cluster a plurality of object data into a plurality of clusters, each cluster including at least one object data, facilitating further data processing by analyzing each object data in the cluster. In The related Art, there are various clustering algorithms capable Of performing clustering on a plurality Of object data, for example, clustering algorithms for performing clustering on a plurality Of cell data include, but are not limited to, leiden (Leiden) algorithm, luwen (Louvain) algorithm, state Of The Art (SOTA) algorithm based on deep learning feature expression, and The like. In general, only one clustering algorithm is used for clustering a plurality of object data, so that the accuracy of a clustering result is low, and the accuracy of subsequent data processing is affected. Disclosure of Invention The embodiment of the application provides a clustering method, a clustering device, electronic equipment and a computer readable storage medium, which can be used for solving the problem that the accuracy of a clustering result is low and the subsequent data processing is influenced in the related technology. In one aspect, an embodiment of the present application provides a clustering method, where the method includes: Acquiring a plurality of first clusters obtained by clustering a plurality of object data based on a first clustering algorithm, wherein one first cluster comprises at least one object data; acquiring a plurality of second clusters obtained by clustering the plurality of object data based on a second clustering algorithm, wherein one second cluster comprises at least one object data; Determining a first cross table based on the plurality of first clusters and the plurality of second clusters, a row of data of the first cross table characterizing each object data in one first cluster, a column of data of the first cross table characterizing each object data in one second cluster; A plurality of third clusters including at least one object data is determined based on the first interleaving table. In another aspect, an embodiment of the present application provides a cluster processing apparatus, where the apparatus includes: The acquisition module is used for acquiring a plurality of first clusters obtained by clustering a plurality of object data based on a first clustering algorithm, wherein one first cluster comprises at least one object data; The acquisition module is further configured to acquire a plurality of second clusters obtained by clustering the plurality of object data based on a second clustering algorithm, where one second cluster includes at least one object data; A determining module, configured to determine a first cross table based on the plurality of first clusters and the plurality of second clusters, where a row of data of the first cross table characterizes each object data in one first cluster, and a row of data of the first cross table characterizes each object data in one second cluster; The determining module is further configured to determine a plurality of third clusters based on the first intersecting table, where one third cluster includes at least one object data. In one possible implementation, a first cluster corresponds to a first cluster identifier, a second cluster corresponds to a second cluster identifier, and the object data includes an object identifier; The determining module is used for determining object identifications corresponding to each cluster identification set based on first cluster identifications of each first cluster, object identifications of each object data contained in each first cluster, second cluster identifications of each second cluster and object identifications of each object data contained in each second cluster, wherein one cluster identification set comprises one first cluster identification and one second cluster identification, and determining a first cross table based on the number of the object identifications corresponding to each cluster identification set. In a possible implementation manner, the first intersecting table includes a plurality of non-zero data, where the non-zero data characterizes the number of identical object data in a first cluster corresponding to a row where the non-zero data is located and in a second cluster corresponding to a column where the non-zero data is located; the determining module is used for updating the first intersecting table based on the non-zero data in the first intersecting table, and determining a plurality of third cl