CN-117349287-B - Mixed storage-based maximum binary cluster enumeration method

CN117349287BCN 117349287 BCN117349287 BCN 117349287BCN-117349287-B

Abstract

The invention discloses a maximum bipartite group enumeration method based on mixed storage, which comprises the steps of obtaining bipartite graphs G (U, V, E), wherein U and V are two disjoint vertex sets in the bipartite graphs, E is an edge set, setting a threshold t for determining a storage mode of a running sub-graph, storing G 0 by adopting an adjacency list if the number of the vertices in the U is greater than t, otherwise storing G 0 by adopting a bitmap and storing U, V, G 0 is the initial parameter of the function, recursively calls BicliqueFind (L, R, C, G) function, and outputs all the counting results of the maximum bipartite groups or the maximum bipartite groups to the user. The invention adopts a hybrid storage method, not only reserves the low storage space overhead characteristic of the adjacent table storage, but also combines the characteristic of high efficiency of bitmap storage and calculation. Aiming at a large number of tasks which only need to be calculated on the small graph in the process of the binary cluster enumeration, the method greatly improves the execution speed of the binary cluster enumeration task by dynamically constructing the sub-bitmap.

Inventors

HE SHUIBING
PAN ZHE
LI XU

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20231024

Claims (2)

1. The maximum binary cluster enumeration method based on the mixed storage is characterized in that a graph structure and a sub-graph structure in operation are stored in a mode of combining an adjacency list and a bitmap, so that the maximum binary cluster enumeration efficiency is improved, and the method specifically comprises the following steps: (1) Obtaining a bipartite graph G 0 (U, V, E), wherein U and V are two disjoint vertex sets in the bipartite graph, E is an edge set, and the bipartite graph comprises a bipartite graph of genes and characters in a gene analysis scene, a bipartite graph formed by users and commodities in an electronic commerce scene or a bipartite graph formed by users and interests in a social network scene; (2) Setting a threshold t for determining a storage mode of the running sub-graph, wherein if the number of the vertexes in the U is greater than t, G 0 adopts an adjacency list for storage, otherwise, G 0 adopts a bitmap for storage; (3) Recursively calling BicliqueFind (L, R, C, G) functions by taking U, ∅, V, G 0 as initial parameters of the functions, and specifically comprises the following sub-steps: 3-1, obtaining four parameters L, R, C and G corresponding to the current enumeration tree node, wherein L=U, R= ∅, C=V and G=G 0 ; 3-2, judging whether the set C is an empty set, if so, exiting the function, otherwise, executing the step 3-3; 3-3 creating a copy (L ', R', C ', G') of the current enumeration tree node (L, R, C, G); 3-4, selecting one vertex v from the set C, calculating an intersection of the set L and a neighbor set of the vertex v, and assigning a calculation result to L'; 3-5, judging whether the number of the vertexes in the L is larger than t and the number of the vertexes in the L 'is smaller than or equal to t, if so, constructing a sub-bitmap according to neighbors of all the vertexes in the L' and assigning the sub-bitmap to G ', otherwise, assigning the original image G stored according to the adjacency list to G'; 3-6, assigning the vertex in the R set to R'; 3-7. Traversing each vertex v c in set C, adding v c to R 'if the L' set is a subset of the neighbor set of v c , adding v c to C 'if the L' set and the neighbor set of v c have intersections; 3-8, judging whether vertexes V x except R U C exist in the V set so that the L' set is a subset of a V x neighbor set, if yes, executing the step 3-10, otherwise, executing the step 3-9; 3-9. Outputting the extremely large dyadic (L ', R'). Recursively executing the function BicliqueFind (L ', R', C ', G'); 3-10, deleting the vertex v from the C set, and executing the step 3-2; (4) And outputting all the maximum bipartite groups or the counting results of the maximum bipartite groups to a user.
2. The hybrid storage based maximum two-tuple enumeration method of claim 1, wherein t is set to 32.

Description

Mixed storage-based maximum binary cluster enumeration method Technical Field The invention belongs to the field of computer graph calculation and data mining, and particularly relates to a method for extremely large binary cluster enumeration based on hybrid storage. Background In the field of data mining, extremely large binary cluster enumeration is an important research topic. It has wide application in a number of fields such as genetic analysis, electronic commerce, overlapping community detection, GNN information aggregation, social relationship recommendation, and the like. Taking gene analysis as an example, we can consider genes and traits as two vertex sets of bipartite graph, and the genes correspond to traits to form the edges of bipartite graph. One group of genes corresponds to one group of characters to form two groups. The maximum dyads are dyads which are not completely contained by any other dyads in the dyads, and the corresponding relation of a group of genes and a group of characters can be described to the greatest extent. By enumerating the largest bipartite clusters in the bipartite graph, biologists can be assisted in understanding the connection relationship between genes and traits. By analyzing the pattern of linkage between different genes, the interactions between genes and their combined effect on trait performance can be revealed. At present, the maximum bipartite enumeration method mainly adopts a recursive algorithm and is realized based on an enumeration tree. Specifically, for a given bipartite graph G (U, V, E), the existing method recursively generates a power set of the V set as the right vertex set R first, and then expands the right vertex set R into the corresponding bipartite group (L, R). The algorithm enumerates all the biggest dyads and prunes the non-biggest dyads therein. In recent years, the optimization of the extremely large binary cluster enumeration method mainly comprises the aspects of vertex ordering, pruning invalid branches, parallelism improvement and the like. With respect to the storage format of the extremely large bipartite group in the bipartite graph, there are common adjacency list storage and bitmap storage. The adjacency list uses arrays of non-fixed length to store the structure of the graph, each array element storing the neighbor information of the vertex. Through the adjacency list, the neighbors of a certain vertex can be efficiently searched. The current main stream of the method for extremely dividing the enumeration of groups usually adopts the form of an adjacency list for storage, because only the actually existing edges need to be stored, thereby saving the storage space. However, in the adjacency list, since the number of neighbors of a vertex may be large, the time required to calculate co-adjacency between vertices may be relatively long. The bitmap maps the set of vertices of the graph into a sequence of bits, each bit representing the connection of the existence of one vertex to the other vertices. By using bit operation, the connection relation between the vertexes can be calculated quickly. However, because bitmaps require a large amount of memory, the bitmap-based maximum two-tuple enumeration algorithm is only applicable to small-scale graphs. Disclosure of Invention Aiming at the problem of low calculation efficiency caused by a single storage format in the prior art, the invention provides a mixed storage-based extremely large binary cluster enumeration method. Aiming at the situation that the number of vertices is small when the maximum bipartite enumeration algorithm is operated, the method adopts a strategy of dynamically generating sub bipartite graphs based on bitmaps and completing enumeration tasks. By means of the mixed storage mode, the method can control storage cost brought by the bitmap at the same time, and the enumeration calculation process of the maximum bipartite group is accelerated by utilizing bit operation based on the bitmap, so that the enumeration efficiency of the maximum bipartite group is remarkably improved. The technical scheme adopted by the invention is as follows: The maximum binary cluster enumeration method based on the mixed storage stores a graph structure and a sub-graph structure in the running process in a mode of combining an adjacency list and a bitmap, improves the maximum binary cluster enumeration efficiency, and specifically comprises the following steps: (1) Obtaining a bipartite graph G 0 (U, V, E), wherein U and V are two disjoint vertex sets in the bipartite graph, E is an edge set, and the bipartite graph comprises a bipartite graph of genes and characters in a gene analysis scene, a bipartite graph formed by users and commodities in an electronic commerce scene or a bipartite graph formed by users and interests in a social network scene; (2) Setting a threshold t for determining a storage mode of the running sub-graph, wherein if the number of the vertexes in the U is