CN-121983145-A - Space transcriptome domain identification method based on double-decoding encoder and neighborhood comparison

CN121983145ACN 121983145 ACN121983145 ACN 121983145ACN-121983145-A

Abstract

The invention discloses a spatial transcriptome domain identification method based on double-decoding encoder and neighborhood contrast, which comprises the following steps of preprocessing spatial transcriptome data, constructing a spatial adjacency graph, learning encoder characteristics based on graph attention, jointly reconstructing double decoders, comparing and learning neighborhood perception, optimizing joint loss functions, and clustering spatial domains. The method avoids the random disturbance to destroy the space structure, improves the reliability of contrast learning, enables the potential representation to reflect the space information more comprehensively, can dynamically allocate weights according to the importance of different neighbors, improves the modeling capability under the complex organization structure, and has high space domain identification precision and strong stability.

Inventors

ZHANG LONG
WU JINGLI
XU HEJUN

Assignees

广西师范大学

Dates

Publication Date: 20260505
Application Date: 20260121

Claims (6)

1. The spatial transcriptome domain identification method based on the comparison of the double-decoding encoder and the neighborhood is characterized by comprising the following steps: Step one, preprocessing space transcriptome data: obtaining spatial transcriptome raw data comprising an original gene expression matrix Wherein Represents a set of real numbers, Representing a set of spatial sample points Is a sum of the number of (c), Representing the total number of the original genes obtained by measurement and the space coordinate information corresponding to each space sample point Then carrying out logarithmic transformation, normalization and standardization treatment on the original gene expression matrix, and screening high variable genes to obtain a pretreated gene expression characteristic matrix Wherein Indicating the number of high variable genes after pretreatment; Step two, constructing a space adjacency graph: based on the space coordinate information of each space sample point, calculating Euclidean distance between the space sample points, and constructing a space adjacency graph by adopting a K nearest neighbor method Taking each space sample point as a graph node to obtain a node set Edge connection is established between the space adjacent sample points to form a space adjacent matrix ; Step three, encoder feature learning based on graph attention: Under the constraint of the space adjacency graph, aiming at each target node, adaptively calculating the neighbor weight corresponding to the neighbor node according to the feature similarity between the target node and the neighbor node, carrying out weighted aggregation by utilizing the feature information of the neighbor node, and updating the features of the target node through a multi-layer graph attention network to obtain the low-dimensional potential representation corresponding to each space sample point; And step four, double decoder joint reconstruction: The method comprises the steps of (1) reconstructing an original gene expression matrix, calculating a gene expression reconstruction loss based on the difference between a reconstruction result and the original gene expression matrix to restrict the low-dimensional potential representation to keep the consistency of gene expression characteristics, (2) reconstructing a space adjacency by an inner product operation between the low-dimensional potential representations, and calculating a graph structure reconstruction loss based on the difference between the space adjacency obtained by reconstruction and the original space adjacency to restrict the low-dimensional potential representation to keep the consistency of space topology, thereby realizing the joint restriction of gene expression characteristic information and space topology information in the same potential representation space; Step five, neighborhood perception contrast learning: Selecting K neighbor sample points closest to the anchor sample point in a space adjacency graph as positive samples in the space adjacency graph aiming at the current anchor sample point, taking the rest sample points which are not adjacent to the anchor sample point in the space adjacency graph as negative samples, calculating neighborhood perception contrast loss on the premise of not damaging an original space structure based on the anchor sample point, the positive samples and the low-dimensional potential representations corresponding to the negative samples, wherein the neighborhood perception contrast loss is calculated by taking the low-dimensional potential representation corresponding to the anchor sample point as a center and is used for measuring the relative similarity relation between the anchor sample point and the low-dimensional potential representations corresponding to the positive samples and the negative samples of the anchor sample point so as to pull the distance between the anchor sample point and the positive samples in the potential representation space and increase the distance between the anchor sample point and the negative samples, thereby enhancing the discrimination capability of the potential representation; Step six, optimizing a joint loss function: constructing a joint loss function based on the gene expression reconstruction loss and the graph structure reconstruction loss obtained in the step four and the field perception contrast loss obtained in the step five, minimizing the joint loss function, and performing iterative optimization on model parameters of a graph meaning encoder, a gene expression decoder and a graph structure decoder through a back propagation mode to obtain final low-dimensional potential representation; Step seven, spatial domain clustering: Based on the low-dimensional potential representation output by the graph annotation meaning encoder in the model parameter convergence state after the optimization training is completed through the joint loss function in the step six, clustering the space sample points by adopting an unsupervised clustering algorithm to obtain a space domain division result of the space transcriptome data, wherein the unsupervised clustering algorithm adopts a clustering method based on a Gaussian mixture model, and the final space domain division result is obtained by means of mclust () function in an R language mclust packet.
2. The spatial transcriptome domain identification method based on double-decoding encoder and neighborhood contrast according to claim 1, wherein the specific steps of constructing the spatial adjacency graph in the second step are as follows: based on the space coordinate information of each space sample point, calculating Euclidean distance between the space sample points, and constructing a space adjacency graph by adopting a K nearest neighbor method Wherein the set of nodes Representing all spatial sample points, edge sets Representing adjacency relations between spatial sample points, further defining a spatial adjacency matrix Representing the adjacency between spatial sample points, wherein, when spatial sample points At a spatial sample point When in the K nearest neighbor sets of (c), Representing spatial sample points And spatial sample points There is an undirected edge connection between, otherwise, Wherein 。
3. The spatial transcriptome domain identification method based on double-decoding encoder and neighborhood comparison according to claim 1, wherein the specific steps of the encoder feature learning based on graph attention in the third step are as follows: Inputting the gene expression characteristics in the gene expression characteristic matrix after preprocessing in the step one and the space adjacency graph constructed in the step two into a graph attention encoder, and obtaining low-dimensional potential representation of each space sample point through a multi-layer graph attention network, wherein an attention mechanism of the graph attention encoder is used for dynamically calculating attention weights corresponding to different neighbor sample points based on node characteristic similarity aiming at the relation between a target space sample point and the space neighbor sample points of the target space sample point, and carrying out weighted aggregation on characteristic information of the neighbor sample points by utilizing the attention weights; Given a constructed spatial adjacency graph Each spatial sample point Regarded as Is provided, wherein the node is a node, and assigning an initial embedded vector , Wherein the vector is initially embedded From a gene expression profile Is the first of (2) A row for representing a corresponding spatial sample point Is used for processing the multi-layer map attention network Features of individual spatial sample points, where the spatial sample points In the first place Updated feature representation of layers Calculated as follows, and , Layer number representing the multi-layer graph attention network: , Wherein, the , , Representing the function of the ELU activation, Representing a spatial adjacency graph Mid-space sample point Is used to determine the neighbor set of a neighbor, Is the first Trainable weight matrix of encoder layer, th Layer drawing annotates spatial sample points in a force layer And Normalized edge weights between The calculation is as follows: , Wherein, the Represents the attention coefficient for quantifying the first Spatial sample points in encoder layers With neighbor spatial sample points The correlation between the two is calculated as follows: , The representation Sigmod activates the function, Is that The layer diagram annotates attention weight parameters in the attention encoder corresponding to the characteristics of the target sample point itself, Is that Attention weight parameters of the layer graph attention encoder corresponding to the characteristics of the neighbor sample points, and the layer graph attention encoder finally generates a low-dimensional potential representation Wherein , Represent the first The first space sample point is in the graphic meaning force encoder The characteristics of the layer output are represented.
4. The spatial transcriptome domain identification method based on double-decoder-encoder-neighborhood comparison according to claim 1, wherein the specific steps of double-decoder joint reconstruction in the fourth step are as follows: A double decoder architecture is adopted, and a gene expression feature matrix and a space domain diagram are reconstructed at the same time, so that a space sample point representation with more information is obtained, wherein the double decoder architecture comprises (1) a gene expression decoder for reconstructing a decoder of original gene expression data and (2) a diagram structure decoder for recovering an inner product decoder of a space diagram topological structure, and the method comprises the following steps: gene expression decoders potentially represent in low dimensions As input, its initialization layer input feature representation is defined as I.e. the final output of the attention coder of the graph is taken as the input of the gene expression decoder, and the spatial sample points In the first place ) Updated feature representation of layers The calculation is as follows : , Representing the function of the ELU activation, Represent the first Trainable weight matrix for each decoder layer, the last layer of decoder generates reconstructed gene expression matrix Wherein Furthermore, the reconstructed adjacency matrix Represented by potential Obtaining: , Wherein the method comprises the steps of Representation Sigmod activates a function; Reconstruction loss function of gene expression data Gene expression matrix after pretreatment for quantification And reconstructed gene expression matrix The difference between: , graph structure reconstruction loss The definition is as follows: 。
5. the spatial transcriptome domain identification method based on double decoding encoder and neighborhood contrast according to claim 1, wherein the specific steps of neighborhood perception contrast learning in the fifth step are as follows: in order to obtain a higher-quality low-dimensional potential representation, a neighborhood perception contrast learning module is introduced, each spatial sample point is taken as an anchor point, positive and negative sample pairs are constructed based on a spatial neighborhood relation, and contrast loss is calculated; In the neighborhood aware contrast learning module, the input is a low-dimensional potential representation derived from the graph annotation force encoder For a given anchor point , Based on spatial adjacency graph Constructing positive and negative sample pairs in a spatial adjacency graph Front of (2) The adjacent points form positive sample pairs, the rest The points are taken as negative samples, and the spatial sample points Neighborhood contrast loss of (v) The calculation is as follows: , Wherein the method comprises the steps of Is a space adjacency graph The number of neighbor spatial sample points in the middle, As the total number of spatial sample points, As a function of the temperature parameter(s), Representing spatial sample points Is characterized in that, Representing the division of spatial sample points The characteristic representation of all spatial sample points outside, Representing the similarity of pairs of spatial sample points, wherein the similarity The cosine distance is measured, and the calculation formula is as follows: , Defining the following neighborhood contrast loss function : 。
6. The spatial transcriptome domain identification method based on double-decoding encoder and neighborhood comparison according to claim 1, wherein the specific step of joint loss function optimization in the step six is as follows: Reconstruction of Gene expression loss Loss of graph structure reconstruction Neighborhood contrast loss function Performing a joint optimization, the joint optimization constituting a final training objective, in particular as follows: , reconstructing a loss function by weighting Loss of graph structure reconstruction Neighborhood contrast loss function And (5) performing joint optimization and training model parameters.

Description

Space transcriptome domain identification method based on double-decoding encoder and neighborhood comparison Technical Field The invention relates to the technical field of intersection of bioinformatics and artificial intelligence, in particular to a spatial transcriptome domain identification method based on comparison of a double-decoding encoder and a neighborhood. Background Complex tissues are composed of a variety of cells, and the relative location of their transcriptional expression in tissues is critical for elucidating their biological functions and for disrupting intercellular communication mechanisms [1]. The advent of Spatial Transcriptomics (ST) technology has fundamentally changed biological studies, which allow the detection of gene expression profiles while preserving the cellular spatial context [2]. The technological breakthrough provides deeper insight into the cellular microenvironment and the regulatory mechanisms of various species and diseases, and has been applied to research in the fields of brain, complex diseases and the like. Current ST techniques can be broadly divided into two broad categories [3]: imaging-based methods and sequencing-based methods. These spatial transcriptomics techniques have generated a large number of spatial transcriptomics data that help us better understand the complex tissue functions of biological systems [4]. The identification of spatial domains, i.e., regions exhibiting similar gene expression profiles and histological features while maintaining spatial adjacency, is the fundamental and most critical aspect of spatial transcriptome data analysis and is a significant challenge to this field [5]. This process enables elucidation of cell type localization and intercellular interactions in tissue structures by analyzing regional gene expression patterns [6]. Accurate identification of spatial domains plays a critical role in resolving the occurrence and progression of disease [7]. To address this challenge economically and efficiently, computational methods have been proposed to model both gene expression data and its inherent spatial dependence, and existing methods can be broadly divided into two categories. The first class of methods attempts to identify spatial domains using some probabilistic technique and the second class of methods is based on deep learning techniques. The invention is a deep learning-based method. In 2021, the SpaGCN method proposed by Hu et al [8] integrates gene expression, spatial proximity and histological similarity, and aggregates the gene expression information of adjacent sites through a graph rolling network. 2022. In the year, dong et al [5] proposed STAGATE method that uses graph attention to adaptively learn the similarity of neighboring blobs from the encoder framework. 2023. Long et al [9] have proposed GraphST a method that combines a graph neural network with self-supervised contrast learning to derive discriminative spatial embedding by minimizing the embedding distance between spatially adjacent blobs. 2024. In the year Xu et al [10] proposed SEDR a method that uses a depth self-encoder in combination with a masked self-supervised learning mechanism to extract a low-dimensional representation of gene expression and integrate it with spatial information through a variational map self-encoder. Zhang et al [11] propose MuCoST a multi-view graph contrast learning framework that enhances the speckle dependence by fusing gene expression dependencies with spatial position adjacencies, thereby enabling modeling of non-local spatial co-expression dependencies and spatial adjacent dependencies to decode complex tissue structures. Liang et al [12] propose SpaGCAC a method that dynamically balances local spatial structure and speckle self-features using an adaptive feature space balance graph convolution network, and combines local topology and probability distribution contrast learning to enhance spatial domain recognition. 2025, zhang et al [13] proposed STMHCG method that combines spatial expression enhancement with high confidence cluster guidance to improve spatial domain identification. Although satisfactory performance is achieved, most existing spatial domain identification methods still have certain limitations: 1. Negative sample construction is unreasonable The existing partial contrast learning method generally generates a negative sample by randomly disturbing a gene expression matrix, and the mode damages the original neighborhood structural relationship among space sample points, so that the negative sample deviates from the real space topology, the obtained characteristic representation cannot truly reflect space structural constraint, and the discrimination capability of the model on the space structure is affected. 2. Focusing on node feature reconstruction only, ignoring graph structure optimization Most existing methods only reconstruct or encode gene expression characteristics, but do not synchronou