Search

CN-120561857-B - Hypergraph network-based multi-study data fusion method and system

CN120561857BCN 120561857 BCN120561857 BCN 120561857BCN-120561857-B

Abstract

The invention belongs to the technical field of biological data processing, and discloses a hypergraph network-based multi-group data fusion method and system. The invention adopts the hypergraph structure to model the histology data, can more efficiently mine high-order interaction information, and more comprehensively reveal complex association modes among biological entities; the hypergraph fusion method effectively integrates the multi-group-study specific hypergraphs, realizes the unified representation of the cross-group-study data, enhances the representation capability of the model on the complex multi-relationship data, and provides more comprehensive technical support for various downstream tasks.

Inventors

  • QIU HANG
  • ZHANG WEN

Assignees

  • 电子科技大学

Dates

Publication Date
20260505
Application Date
20250523

Claims (9)

  1. 1. The hypergraph network-based multi-group data fusion method is characterized by comprising the following steps of: obtaining multiple groups of chemical data of a plurality of cell line samples and carrying out standardized treatment; constructing a specific hypergraph of each histology data by using the normalized multiple histology data; performing hypergraph aggregation on the specific hypergraph of each group of the data independently to obtain node characteristics, hyperedge characteristics and corresponding weighted association matrixes of each group of the data after updating; performing multi-group learning fusion on the node characteristics updated by all the group learning data based on a graph level attention mechanism to obtain fusion node characteristics; Identifying shared supersides and specific supersides in the supersides after updating all the group study data, performing cross-group study fusion on the shared superside characteristics by applying the graph-level attention mechanism, and reserving the specific superside characteristics, wherein the shared supersides refer to supersides with the same node connection relationship in the specific supergraphs of all the group study data, and the specific supersides refer to supersides except the shared supersides, which are not commonly contained by all the group study; combining the fused shared superside characteristics, the reserved specific superside characteristics and the corresponding incidence matrixes to obtain a fused supergraph and incidence matrixes and superside characteristics thereof; and performing interactive updating of the nodes and the supersides by using the fusion node characteristics, the superside characteristics of the fusion supergraph and the association matrix of the fusion supergraph, and obtaining the cell line characteristics through supergraph convolution.
  2. 2. The method of claim 1, wherein the normalizing process comprises at least one of noise filtering, missing value filling, feature normalization and biometric identifier mapping, cross-platform batch correction, dimension reduction, data alignment, gaussian regularization, and the like for the plurality of sets of data.
  3. 3. The method of claim 1, wherein the plurality of sets of chemical data comprises gene expression, somatic mutation, and copy number variation.
  4. 4. The method for fusing multiple sets of chemical data according to claim 1, wherein the method for constructing the specific hypergraph of each set of chemical data by using the normalized multiple sets of chemical data comprises the steps of calculating a corresponding cell line similarity matrix for each set of chemical data in the normalized multiple sets, and constructing the specific hypergraph of each set of chemical data by using a KNN algorithm based on the cell line similarity matrix.
  5. 5. The method of claim 4, wherein the index similarity is used to calculate a corresponding cell line similarity matrix for each of the normalized plurality of sets of data.
  6. 6. The method of claim 1, wherein the method of performing hypergraph aggregation independently for each specific hypergraph of the omic data comprises: Learning the initial characteristic of the superside in the specific supergraph by adopting a superside attention mechanism to obtain updated superside characteristics; The optimized superside characteristics are interacted with nodes in the specific supergraph, and interaction weights of the nodes and adjacent supersides are quantized; And carrying out multi-layer graph rolling operation on the node characteristics in the interaction weight and the specificity hypergraph to obtain updated node characteristics and a corresponding weighted association matrix.
  7. 7. The method for merging multiple groups of chemical data according to claim 1, wherein the method for generating the initial feature of the superside is that the initial feature is obtained by taking an arithmetic average of all node features connected by the superside.
  8. 8. The method for fusing multiple sets of chemical data as recited in claim 1, wherein the method for calculating the level attention coefficients in the level attention mechanism comprises: Splicing node characteristics of all the updated histology data into a global matrix; and based on the global matrix, calculating the original weight of each group of the data by using the shared parameter vector, and normalizing the original weight to obtain the level attention coefficient.
  9. 9. A hypergraph network-based multi-mathematics data fusion system, comprising: the system comprises a data processing module, a construction module, a data processing module and a data processing module, wherein the data processing module is used for acquiring a plurality of groups of chemical data of a plurality of cell line samples and carrying out standardized processing; the aggregation module is used for independently carrying out hypergraph aggregation on the specific hypergraphs of each group of the data to obtain node characteristics, hyperedge characteristics and corresponding weighted association matrixes after updating of each group of the data; the fusion module is used for carrying out multi-group learning fusion on the node characteristics updated by all the group learning data based on a graph level attention mechanism to obtain fusion node characteristics; The system comprises a superside processing module, a graph level attention mechanism and a graph level attention mechanism, wherein the superside processing module is used for identifying shared supersides and specific supersides in supersides after updating all the group study data, performing cross-group study fusion on shared superside features and reserving specific superside features; The merging module is used for merging the merged shared superside features, the reserved specific superside features and the corresponding incidence matrixes to obtain a merged supergraph and incidence matrixes and superside features thereof; The convolution module is used for utilizing the fusion node characteristics, the superside characteristics of the fusion supergraph and the correlation of the fusion supergraph And performing interactive updating of nodes and supersides on the matrix, and obtaining cell line characteristics through supergraph convolution.

Description

Hypergraph network-based multi-study data fusion method and system Technical Field The invention relates to the technical field of biological data processing, in particular to a hypergraph network-based multi-group biological data fusion method and system. Background With the rapid development of high-throughput sequencing technology, researchers can acquire massive different types of histology data such as genome, transcriptome, epigenomic data and the like, and an important basis is provided for revealing molecular mechanisms of vital activities. Although the single-group analysis can reflect the biological rules of specific molecular layers, the complex regulation and control mechanism of a biological system is difficult to comprehensively describe, and the multi-group analysis can integrate information of different molecular layers, so that the intrinsic rules of the biological process can be systematically analyzed, and the single-group analysis has important significance in the fields of prognosis diagnosis, biomarker discovery, drug response prediction and the like. However, due to the high dimensionality, strong heterogeneity, and multiple noise characteristics of different sets of data, and the complex nonlinear relationships between the data, how to effectively fuse these multidimensional data to mine their underlying information deep remains a significant challenge. In recent years, the deep learning method based on the graph network can represent complex interactions among biological entities by graph structures, shows excellent performances in aspects of feature extraction, complex biological data processing and large-scale data mining, and provides a new technical approach for multi-study data fusion. However, the current multi-set data fusion method based on graph structure still has the following disadvantages. Firstly, the traditional graph neural network learning method only focuses on binary correlation among paired nodes, and ignores complex high-order relations among different sets of learning data. Secondly, the existing fusion strategy mostly adopts simple characteristic splicing operation to complete the integration of multiple groups of the characteristics, so that deep characteristic association inside the group is not fully excavated, and exploration of nonlinear interaction among the group-crossing learns is lacking. Disclosure of Invention The invention aims at solving at least one technical problem and provides a hypergraph network-based multi-group data fusion method and system. In order to achieve the above purpose, the first technical scheme adopted by the invention is as follows: the hypergraph network-based multi-group chemical data fusion method comprises the following steps: obtaining multiple groups of chemical data of a plurality of cell line samples and carrying out standardized treatment; constructing a specific hypergraph of each histology data by using the normalized multiple histology data; performing hypergraph aggregation on the specific hypergraph of each group of the data independently to obtain node characteristics, hyperedge characteristics and corresponding weighted association matrixes of each group of the data after updating; performing multi-group learning fusion on the node characteristics updated by all the group learning data based on a graph level attention mechanism to obtain fusion node characteristics; Identifying shared supersides and specific supersides in the supersides after updating all the group learning data, performing group crossing fusion on the shared superside features by applying the graph level attention mechanism, and reserving the specific superside features; combining the fused shared superside characteristics, the reserved specific superside characteristics and the corresponding incidence matrixes to obtain a fused supergraph and incidence matrixes and superside characteristics thereof; and performing interactive updating of the nodes and the supersides by using the fusion node characteristics, the superside characteristics of the fusion supergraph and the association matrix of the fusion supergraph, and obtaining the cell line characteristics through supergraph convolution. Preferably, the normalization process includes at least one of noise filtering, missing value filling, feature normalization and biometric identifier mapping, cross-platform batch correction, dimension reduction, data alignment, gaussian regularization, and the like for multiple sets of chemical data. Preferably, the plurality of sets of chemical data includes gene expression, somatic mutation, and copy number variation. Preferably, the method for constructing the specific hypergraph of each histology data by using the normalized multiple sets of histology data comprises the steps of respectively calculating a corresponding cell line similarity matrix for each histology data in the normalized multiple sets of histology data, and respectively constructing the