CN-120748500-B - Method for identifying space domain based on data interpolation and cell type deconvolution
Abstract
The invention provides a method for identifying a space domain based on data interpolation and cell type deconvolution, belongs to the technical field of bioinformatics, and aims to solve the problem that the prior information of cell types at the tissue space structure level cannot be fully integrated by utilizing gap information between adjacent points in low-resolution space transcriptome data in the traditional method. Data interpolation is carried out on the preprocessed space transcriptome data, and cell type deconvolution is carried out by combining single cell RNA sequencing data. And constructing a deep learning model based on the graph rolling network. And training a deep learning model according to the gene expression information, the spatial position information and the cell type information of the spatial transcriptome data after the cell type deconvolution by using a self-supervision contrast learning strategy. And carrying out spatial domain identification on the data to be detected based on the trained model.
Inventors
- ZHANG TIANJIAO
- LI SHENGHE
- WANG GUOHUA
Assignees
- 东北林业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20250606
Claims (8)
- 1. A method for spatial domain identification based on data interpolation and cell type deconvolution, comprising: step 1, acquiring a space transcriptome data set and a single-cell RNA sequencing data set, and carrying out data preprocessing on the acquired data set; Step 2, carrying out data interpolation on the preprocessed space transcriptome data and carrying out cell type deconvolution by combining single-cell RNA sequencing data; the step 2 specifically comprises the following steps: Step 2.1, dividing original gene expression data in the preprocessed space transcriptome data row by row according to the horizontal direction, dividing the original gene expression data in each row according to diagonal lines, and dividing a complete gap area among a central point, a left vertex and a right vertex; step 2.2, calculating centroid coordinates of the corresponding void regions based on the original space coordinate data of the gene expression data of the void regions of each divided region Taking the barycenter coordinates as the space coordinates of the void area; Step 2.3, carrying out gene expression data interpolation on the divided gap areas according to a neighborhood interpolation method based on the original gene expression data and the space position information of the cells corresponding to the original gene expression data; step 2.4, carrying out cell type deconvolution on the interpolated gene expression data by combining the pretreated single-cell RNA sequencing data, and deducing the cell composition of each gene expression data sampling point; barycentric coordinates of void regions The calculation formula of (2) is as follows: In the formulas (1) and (2), For the coordinates of the points in each region after the segmentation, N is the total number of the points in the region; the expression of the interpolation of the gene expression data is: in the formula (3), Is space coordinates The expression of the gene at the site, For coefficients calculated from weighting factors such as spatial distance, Gene expression values for other points in the neighborhood; Step 3, constructing a deep learning model based on a graph rolling network; The deep learning model based on the graph convolutional network constructed in the step 3 comprises the following steps: the system comprises a space adjacency graph construction module, a gene expression random graph construction module, an encoder-decoder module and a self-supervision contrast learning module; the space adjacency graph construction module is used for constructing a weighted adjacency graph according to the gene expression data after deconvolution of the cell types, and obtaining space distance information between the gene expression data and cell type similarity measurement; The gene expression random graph construction module generates a gene expression random graph G ' = (V ', E ') by adding disturbance in an original neighborhood graph G= (V, E) generated according to the weighted adjacent graph; The GCN-based encoder module is used for modeling gene expression data in an original neighborhood graph G= (V, E) and a gene expression random graph G ' = (V ', E '); The self-supervision learning comparison module is used for generating positive and negative sample pairs to optimize node representation according to an original neighborhood graph G= (V, E) and a gene expression random graph G ' = (V ', E '), and training a deep learning model; Inputting gene expression data of the space transcriptome data after cell type deconvolution into a deep learning model, and training the deep learning model by combining the gene expression information and the space position information of the space transcriptome data according to the cell type information obtained by cell type deconvolution by using a self-supervision contrast learning strategy; And 5, carrying out spatial domain identification on the data to be detected based on the trained model.
- 2. The method of spatial domain identification based on data interpolation and cell type deconvolution according to claim 1, wherein the spatial transcriptome dataset of step 1 includes gene expression data of a shot and spatial location information of the shot, the single-cell RNA sequencing dataset includes gene expression data of a cell and cell type information, wherein the gene expression data matrix dimension in the spatial transcriptome dataset is shot x gene and is arranged in a regular honeycomb shape on a tissue section, and the gene expression data matrix dimension in the single-cell RNA sequencing dataset is cell x gene.
- 3. The method for spatial domain identification based on data interpolation and cell type deconvolution according to claim 1, wherein the data preprocessing of the acquired data set in step 1 specifically comprises: step 1.1, carrying out logarithmic conversion and normalization on gene expression data of the space transcriptome data set; And 1.2, screening out preset numerical value highly variable genes from the normalized space transcriptome data and single-cell RNA sequencing data, and finishing pretreatment of the space transcriptome data set and the single-cell RNA sequencing data set.
- 4. The method for spatial domain identification based on data interpolation and cell type deconvolution of claim 1, wherein the step of training the deep learning model in step 4 comprises: Step 4.1, inputting the gene expression data after deconvolution of the cell types into a space adjacency graph construction module to obtain space distance information between the gene expression data and cell type similarity measurement; Step 4.2, inputting an original neighborhood graph G= (V, E) generated according to the weighted adjacent graph into a gene expression random graph construction module, and adding disturbance to generate a gene expression random graph G ' = (V ', E '); step 4.3, inputting an original neighborhood graph G= (V, E) and a gene expression random graph G ' = (V ', E ') into an encoder module based on GCN, and modeling according to gene expression data in the input graph; and 4.4, inputting the original neighborhood graph G= (V, E) and the gene expression random graph G ' = (V ', E ') into a self-supervision and comparison module, and training the depth model by using a self-supervision and comparison learning strategy.
- 5. The method for spatial domain identification based on data interpolation and cell type deconvolution of claim 4, wherein step 4.1 specifically comprises: converting gene expression data after deconvolution of the cell type into an undirected graph, wherein the undirected graph comprises a plurality of nodes V and edges E, wherein the nodes V are all sampling points, and the edges E are the connection relations between the points; Step 4.1.2, calculating Euclidean distance between sampling points according to the space coordinates of each point in the undirected graph to obtain a distance matrix; Adopting a K neighbor strategy, selecting K sampling points with the nearest Euclidean distance of the point as neighbor nodes thereof to obtain a preliminary neighbor matrix, wherein if the point j in the preliminary neighbor matrix is the neighbor of the point i, the preliminary neighbor matrix is 1, if the point j is not the neighbor of the point i, the preliminary neighbor matrix is 0, adding the original neighbor matrix of the point j and the transpose thereof, and carrying out truncation processing on the value larger than 1 to enable all the sampling point connections to be binarized; and 4.1.4, introducing cell type information, calculating any connected point pair in the preliminary adjacency matrix, calculating the similarity of the cell types, and calculating the side weight according to the similarity of the cell types and the basic weight of the spatial information.
- 6. The method for spatial domain identification based on data interpolation and cell type deconvolution of claim 4, wherein step 4.3 specifically comprises: Step 4.3.1 constructing Gene expression data into a map in accordance with the map convolution network in combination with the weighted adjacency map Wherein V is a node set, E is an edge set, the node set V comprises all sampling points, and the edge set E represents the connection relation between nodes; Step 4.3.2 setting map In (a) Is a node characteristic matrix, wherein For the number of nodes, d is the characteristic dimension of each node, A binarized adjacency matrix between nodes; step 4.3.3, normalizing the adjacent matrix A to obtain a normalized adjacent matrix Wherein D is a diagonal matrix, the diagonal elements of the matrix are , Degree for node i; Step 4.3.4, normalized adjacency matrix Trainable weight matrix and bias term input encoder of corresponding layer of graph rolling network to obtain potential representation of sampling point ; Step 4.3.5 potential representation of sample points Input into a decoder and mapped back into the original gene expression space by the decoding process and reconstructed loss function by minimizing gene expression And optimizing parameters of the deep learning model.
- 7. The method for spatial domain identification based on data interpolation and cell type deconvolution of claim 4, wherein step 4.4 specifically comprises: step 4.4.1 generating two representation matrices with the original gene expression map G and the random gene expression map G' as inputs, respectively, by means of a map-rolling network-based encoder And Generating an embedded representation of each node by aggregating neighbor information of each node in the matrix, resulting in a local context vector for each node ; Step 4.4.2 constructing positive and negative sample pairs, wherein the matrix of sample points i With its local context vector Form positive sample pairs, in the random graph G', representation of sample points i Vector with its context Forming a negative sample pair; Step 4.4.3 constructing a positive sample contrast loss function And negative sample loss function Optimizing the quality of node representation by maximizing mutual information between positive pairs of samples while minimizing mutual information between negative pairs of samples; step 4.4.4 comparison of loss function based on positive samples Negative sample loss function Self-reconstruction loss function Integral training loss function for constructing deep learning model According to the overall training loss function Training the deep learning model to obtain potential representations of all the sampling points.
- 8. The method for spatial domain identification based on data interpolation and cell type deconvolution of claim 1, wherein step 5 specifically comprises: And carrying out cluster analysis on potential representations of all points generated by training by using a mclust method to obtain spatial domain identification results of different cell tissues.
Description
Method for identifying space domain based on data interpolation and cell type deconvolution Technical Field The invention relates to a method for identifying a space domain based on data interpolation and cell type deconvolution, belonging to the technical field of bioinformatics. Background In recent years, space transcriptomics technology has rapidly developed, and not only has the system characterized cell heterogeneity on the premise of preserving the tissue space background, but also provides powerful tools for precisely dividing the space domain inside the tissue, thereby deepening understanding of the cell-environment interaction mechanism. Currently, the dominant spatial transcriptome technologies fall into two main categories, (i) in situ hybridization or in situ sequencing methods, e.g., seqFISH, MERFISH, STARmap and FISSEQ. These methods can achieve resolution at the cellular and even subcellular level by directly detecting the pre-set RNA targets in situ, but have limited multiplex detection capability, typically only hundreds to thousands of genes, and (ii) in situ capture techniques such as Spatial Transcriptomics (ST), SLIDE-seq, zipSeq, and HDST. Such methods allow for unbiased analysis of whole transcriptomes by capturing transcripts in situ in tissue and sequencing them ex vivo. However, this "dot" based capture strategy has the limitation that a platform represented by 10X Visium has 4992 dots of 55 μm diameter printed on the slide with a center-to-center spacing of about 100 μm resulting in about 70% of the regional gene expression dataNot measured, there is a large information gap. Traditional spatial domain identification methods rely mainly on spatial proximity assumptions, i.e. dividing the spatial domain by comparing the overall expression pattern between adjacent spots. Non-spatial clustering methods (e.g., K-means, louvain, seurat) often fail to capture the continuity of the tissue and therefore often fail to adequately reflect the actual spatial pattern of the tissue. The methods rely on gene expression information only, neglect the spatial relationship between points and cannot effectively identify the intrinsic structural features of tissues. In order to solve this limitation, various improved methods of combining spatial information have been proposed in recent years. For example Giotto, a Hidden Markov Random Field (HMRF) model is adopted, a spatial domain with a coherent gene expression pattern is identified by fully utilizing the spatial dependency detection among points, bayesSpace, a Bayesian statistical method is adopted, spatial neighborhood information is integrated to optimize a clustering result, STAGATE, GAADE, a Graph Attention Auto-Encoder framework is combined, and the spatial information and the gene expression pattern are further integrated. In addition, there are studies showing that histopathological images can effectively predict gene expression, and thus there are also methods (e.g., spaGCN and stLearn) to improve spatial domain identification by integrating neighborhood information and morphological features, while DeepST further combines image features, gene expression and spatial localization. Despite the significant advances made by these methods, challenges remain in terms of the difficulty in fully mining the topological features of the spatial relationships between cells or points in unlabeled data. To this end, some approaches introduce contrast learning, which alleviates the challenges presented by finite tag data by learning a low-dimensional representation using the inherent structure and properties of the data. For example conST performs contrast learning at the point, subgroup and global level integrating multi-modal data, conGI devised a joint learning strategy comprising gene expression, images and contrast loss functions between them, graphST integrated spatial information with gene expression by graph self-supervised contrast learning, muCoST enhanced point dependence by fusing gene expression correlation and spatial proximity, recently STAIG integrated gene expression, spatial coordinates and histological images in combination with graph contrast learning and high performance feature extraction. However, although these approaches have made breakthroughs in the combination of spatial information, histological images, and gene expression data, they still rely primarily on spatial proximity assumptions, i.e., assuming that adjacent spots have similar gene expression patterns. This hypothesis, while revealing to some extent the spatial distribution of gene expression in tissues and the regional nature of cell types, does not fully exploit the cellular organization information in the gene expression data. Thus, these methods may not accurately capture the spatial distribution and functional characteristics of functionally related cell populations driven by a particular biological process when they are identified, thereby limiting the precise de