CN-121999866-A - Spatial transcriptome cell composition inference method based on graph contrast learning
Abstract
The invention provides a graph contrast learning-based space transcriptome cell composition inference method, which comprises the steps of constructing a cross-modal low-dimensional characteristic space data simulation module, mapping single cell and space transcriptome data to potential embedded spaces respectively, reducing characteristic space difference through alignment distribution, enhancing structural consistency of cross-modal data, designing a double-heterogeneous graph construction and potential relation learning module, constructing characteristic heterogeneous graphs of two cells and space points based on gene expression similarity, capturing high-level potential relations of the space data in the double graphs by adopting a meta-path potential relation inference strategy, introducing a structural contrast learning-based graph feature optimization mechanism, maximizing consistency of cross-modal nodes in the embedded spaces, and obtaining a final cell composition inference result. The invention realizes the inference of the cell composition of the space transcriptome data, can be used for cell space positioning and tissue microenvironment analysis, and provides a reliable calculation and analysis basis for relevant biomedical research.
Inventors
- DONG YAO
- Ai Hanzhen
- ZHANG ZHIYU
- Qi Haojia
- HAN SONGNIAN
- HUANG JINLONG
Assignees
- 河北工业大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260126
Claims (5)
- 1. A graph contrast learning-based spatial transcriptome cell composition inference method, comprising the steps of: Step one, preprocessing from a single cell multi-group chemical data set to obtain scRNA-seq data Spatial transcriptome data ; Step two, cell type annotation based on scRNA-seq data and space coordinate matrix of space transcriptomic data Generating pseudo-space point expression matrix by applying cross-mode pseudo-space point generation strategy With pseudo-space point cell composition ; Step three, single cell data are obtained And real space point Single cell data And pseudo space point Respectively splicing to construct two joint input matrixes, respectively inputting two variations to the encoder to perform potential representation learning, and mapping the input data into potential variables in a Gaussian distribution form by the encoder, wherein the potential representations are represented The method comprises the following steps of generating by a re-parameterization mode: Wherein, the And Representing the potential distribution mean vector and standard deviation vector output by the encoder respectively, Representing an element-by-element multiplication, To obey a standard normal distribution of random noise vectors, In a unitary matrix, thereby obtaining a low-dimensional potential representation based on single-cell-real space point joint input and single-cell-pseudo space point joint input, respectively: Wherein, the Representing the code mapping function of the variation from the encoder, Representing a joint potential representation resulting from co-coding of single cells with real spatial points, Representing a joint potential representation resulting from co-coding of single cells with pseudo-spatial points; step four, utilizing a cross-modal consensus connection strategy to potentially represent the space And In constructing real space point-cell heterograms And pseudo-spatial point-cell heterograms ; Step five, designing a meta-path potential relation reasoning strategy from the heterogeneous graph And Learning isomorphic diagrams of real space points Isomorphic diagram of sum pseudo space point Obtaining a weighted adjacency matrix of real space points Weighted adjacency matrix with pseudo-space points ; Step six, weighting the adjacent matrix for the real space point Weighted adjacency matrix with pseudo-space points Designing shared graph convolution encoder, respectively inputting the encoder to extract real space point node embedded representation Embedding representations with pseudo-spatial point nodes Through aggregation of node self characteristics and neighborhood node characteristics, representation learning based on graph structure is realized: Wherein, the , , Is an identity matrix of the unit cell, Representing a shared graph convolutional encoder; Step seven, node embedding representation based on real space point Embedding representations with pseudo-spatial point nodes Constructing a joint optimization target, optimizing model parameters through a structure contrast learning mechanism and semantic supervision learning, and outputting a spatial point cell composition inference result 。
- 2. The method for spatial transcriptome cell composition inference based on graph-contrast learning of claim 1, wherein a cross-modal pseudo-spatial point generation strategy is adopted in the second step, and the spatial coordinate matrix of the spatial transcriptome data and cell type annotation information according to scRNA-seq data Generating pseudo-spatial point expression matrix Corresponding pseudo space point real cell composition ratio The strategy effectively reserves the local topology structure of the space transcriptome by carrying out neighborhood matching on single cells and real space points and simultaneously generates an aligned pseudo space point expression matrix The method not only can alleviate the sparsity problem of single-cell data, but also can enhance the structural consistency of pseudo space points and real space points in potential space, and provides more accurate structural prior for cell composition inference, and the specific implementation process is as follows: step b 1. ScRNA-seq data matrix And spatial transcriptomics data matrix Mapping to a shared low-dimensional potential space, and searching each cell node in the shared low-dimensional potential space by using a K neighbor algorithm based on the similarity measure The nearest two real space points in the model (a), the real coordinates of the two space points are respectively as follows And (3) with Calculating the midpoint coordinates of the connecting lines And giving the simulated coordinates to the cells, traversing all the cells to obtain a simulated coordinate matrix of the cells ; Step b2, constructing a hexagonal grid in a region covered by the space transcriptomics data, and simulating a coordinate matrix according to the simulation coordinate matrix Distributing each cell to a corresponding grid unit, restricting and balancing the number of cells in each cell to maintain the number within a preset range, summing up gene expression vectors of all cells in each non-empty grid unit to generate an expression vector of a pseudo space point, and stacking all the expression vectors of the pseudo space point according to rows to form a pseudo space point expression matrix ; Step b3, based on the cell type annotation information of single cells, counting and normalizing the single cell numbers distributed to different cell types in the same space unit to obtain a pseudo space point cell composition, stacking all composition vectors according to rows to form a pseudo space point real cell composition matrix 。
- 3. The method for deducing the cellular composition of a spatial transcriptome based on graph-contrast learning according to claim 1, wherein in the fourth step, the spatial representation is potentially represented by using a cross-modal consensus ligation strategy And In constructing real space point-cell heterograms And pseudo-spatial point-cell heterograms Specifically, the strategy is based on the principle of bidirectional neighborhood consistency, cross-modal connection relations are screened between space point nodes and cell nodes, and the cross-modal connection relations are confirmed to be edges in a graph structure only when cross-modal node pairs appear mutually in respective nearest neighbor searches and are respectively included in edge sets And Thereby constructing a heterogeneous graph structure with biological consistency, low noise and stable structure And The specific implementation process is as follows: Step c 1-in potential representation space In the method, aiming at each real space point node, the nearest point node is searched Each real space point node and the nearest Obtaining candidate cross-modal connection relation set from space point to cell by cell nodes, searching nearest to each cell node in the same potential representation space Obtaining a candidate cross-modal connection relation set from cells to space points by the real space point nodes; step c2 for any real space point node With any cell node If in step c1, the node Node is connected with Is listed as one of its nearest neighbor cell nodes, and the node At the same time connect nodes If the space point-cell node pair is one of nearest adjacent space point nodes, judging that the space point-cell node pair meets the bidirectional neighborhood consistency condition, and marking the space point-cell node pair as an effective cross-mode consensus connection relation; step c3, only preserving the cross-modal consensus connection relation passing the two-way neighborhood consistency verification, formally constructing the cross-modal consensus connection relation as an edge in the heterogram to form a real space point-cell heterogram Wherein Including all real space point nodes and cell nodes, A set of edges generated for a cross-modal consensus connection policy; step c4 similarly, in the potential representation space Repeating steps c1 to c3 to construct pseudo-space point-cell heterograms Wherein Including all pseudo-spatial point nodes and cell nodes, A set of edges generated for connection policy by cross-modal consensus.
- 4. The graph-contrast learning-based spatial transcriptome cell composition inference method according to claim 1, wherein the design element path potential relationship inference strategy in the fifth step is from a heterogeneous graph And Learning isomorphic diagrams of real space points Isomorphic diagram of sum pseudo space point Obtaining a weighted adjacency matrix And (3) with In particular, the strategy is used for isomorphic mapping at real space points Isomorphic diagram of sum pseudo space point Characterizing higher-order potential structural relationships between spatial points via cellular node transfer, the weighted adjacency matrix And (3) with As the structural input of the subsequent graph convolution encoder, the specific implementation process is as follows: Step d1, in The definition of a metapath space point-cell-space point, abbreviated as S-C-S, means that two space points are considered to have potential structural association if sharing at least one adjacent cell node, on the basis of which an isomorphic diagram comprising only real space points is constructed Node set Is a set of all real space points and a set of edges The method comprises two parts, namely an original spot-spot direct connection and a new edge obtained by S-C-S element path reasoning; step d2 for pseudo-spatial point isomerism map Similarly, constructing pseudo space point-cell-pseudo space point of element path, abbreviated as P-C-P, in order to maintain semantic consistency of pseudo space point structure and real space point structure, only cell nodes appeared in real S-C-S element path are reserved, and finally obtaining isomorphic diagram of pseudo space point ; Step d3, calculating the Euclidean distance in the potential representation space And normalized according to the maximum distance inside each graph: Wherein, the And Is the potential spatial distance that is to be used, And Normalized edge weights between nodes in the real space point diagram and the pseudo space point diagram are respectively represented, so that a real space point weighted adjacency matrix is obtained And pseudo-space point weighted adjacency matrix The method is used for respectively representing topological connection relations between nodes in the real space point diagram and the pseudo space point diagram and is used as structural input in subsequent diagram convolution learning.
- 5. The graph-contrast learning-based spatial transcriptome cell composition inference method of claim 1, wherein a structure-contrast learning mechanism is designed in the seventh step, and nodes corresponding to real space points are embedded Node embedding corresponding to pseudo-spatial points Constructing a combined optimization target combining structural consistency constraint and semantic supervision constraint, optimizing model parameters, and outputting a spatial point cell composition inference result The specific implementation process is as follows: Step e1 minimizing node-based embedding And (3) with The constructed structure distribution difference enables the two graph structures to form a consistent distribution mode in potential space, and the structure contrast loss is calculated : Wherein, the And The node embedding empirical distributions representing the real space point map and the pseudo space point map respectively, Represents the Kullback-Leibler divergence; Step e2, node embedding based on pseudo space point Predicting cell composition of pseudo-spatial points by shared decoders : Wherein, the As a matrix of projections that can be learned, As a result of the value of the offset, The function is used for converting the input vector into a probability distribution vector; Step e3, calculating the cell composition of the pseudo-spatial points True cell composition with pseudo-space point KL divergence between the two to obtain semantic supervision loss : Step e4, carrying out weighted combination on the structural contrast loss and the semantic supervision loss to construct an overall loss function: Wherein, the For the weight-over-parameters of the structural contrast loss, Minimizing weight superparameters for semantic supervision loss by back propagation algorithm To optimize all parameters of the model; Step e5, embedding the real space point nodes by using the optimized shared decoder Decoding to obtain the predicted result of cell composition of real space point : 。
Description
Spatial transcriptome cell composition inference method based on graph contrast learning Technical Field The invention relates to the technical fields of bioinformatics and machine learning, in particular to a spatial transcriptome cell composition inference method based on graph comparison learning, which is mainly applied to the fields of spatial transcriptome data analysis, fine characterization of cell types and spatial distribution characteristics thereof in complex tissue microenvironment, tumor microenvironment and tissue heterogeneity analysis, disease occurrence mechanism research, precise medicine and biomarker discovery and the like. Background With the continuous development of high-throughput sequencing technology, spatial transcriptomics and single-cell RNA sequencing technology are widely applied to tissue structure analysis and cell heterogeneity research. The space transcriptome technology can measure the gene expression level while maintaining the space position information of the tissues, and provides an important means for researching the tissue microenvironment and the space regulation mechanism. However, limited by sequencing resolution and experimental costs, existing spatial transcriptome techniques often fail to reach single cell levels, one spatial point often contains mixed signals of multiple cell types, and it is difficult to directly resolve the cell composition at a spatial location. In contrast, single-cell RNA sequencing technology can obtain a high-resolution transcriptome expression profile at a single-cell scale, which is helpful for researching cell types, states and differentiation tracks thereof, but loses spatial position information of cells in the data acquisition process, and cannot reflect real spatial distribution of cells in tissues. Therefore, integrating the spatial transcriptome data with single cell RNA sequencing data, and deducing the composition ratio of different cell types in a spatial point through a calculation method has become a key research direction in spatial transcriptome analysis. In response to the above problems, researchers have proposed a variety of spatial transcriptome cell composition inference methods based on scRNA-seq annotation information. Existing methods can be broadly divided into three categories, statistical model-based methods, regression-based methods, and deep learning-based methods. Statistical model-based methods typically rely on poisson distribution or negative binomial distribution assumptions to construct probabilistic models, model spatial point gene expression and infer cell type abundance. Such methods are strongly dependent on distribution assumptions and a priori parameters, and model stability and generalization ability tend to be limited in the presence of technical noise, outliers, or differences in different sequencing platforms. Regression-based methods typically represent the expression of spatial points as a linear combination of single cell expression profiles, solved by means of non-negative least squares or matrix factorization, etc. The method reduces the constraint brought by the distribution assumption to a certain extent, improves the calculation efficiency, but the linear modeling assumption is difficult to learn the nonlinear interaction among cells in complex tissues, and is limited in the highly heterogeneous tissue environment. In recent years, with the development of graphic neural networks and deep learning techniques, some studies began to infer cell composition by constructing graphic structures to jointly learn potential nonlinear features of spatial transcriptome data and single cell data. The method improves the accuracy and the robustness of the spatial cell composition inference to a certain extent. However, existing deep learning methods still face many challenges in practical applications. Firstly, the prior method constructs pseudo space points by adopting random or simple regular aggregation on single cells, but due to factors such as sequencing depth difference, batch effect, space constraint deletion and the like, the generated pseudo space points often have systematic deviation with real space data, secondly, the prior method adopts a simple connection strategy based on characteristics when constructing a cross-modal graph structure, the high-order semantic relation between the space points and the cells is difficult to fully capture, and in the cross-modal representation learning process, the prior method often adopts unified potential embedding to directly represent learning and fails to fully utilize the consistency information between modes. In general, the current approach to infer cell composition using spatial transcriptome and single cell transcriptome integration has the core challenge of efficiently bridging the inherent distribution differences between cross-modal data and building a robust model based on this that is capable of adequately capturing the higher-order se