CN-121997037-A - Ultra-high resolution space transcriptome subcellular expression characteristic extraction method and application thereof

CN121997037ACN 121997037 ACN121997037 ACN 121997037ACN-121997037-A

Abstract

The invention relates to a method for extracting subcellular expression characteristics of an ultra-high resolution space transcriptome and application thereof, wherein the characteristic extraction method comprises the steps of loading space transcriptome data comprising a gene expression matrix and a space coordinate file, carrying out quality control, standardization and logarithmic transformation treatment on the space transcriptome data, packaging the space transcriptome data into a plurality of overlapped image data tiles by using a Tiling strategy, and outputting bin 1-level characteristic representation, namely CLS Token and Patch Token of each image data tile by using the overlapped image data tiles as input and utilizing a characteristic extraction model based on ViT-L architecture. Compared with the prior art, the method has the advantages of strong characterization capability, no need of marking, high calculation efficiency and the like.

Inventors

LV HUI
Zhong Bingxu
KONG YAN
REN YONGYONG
ZHANG YUJIA
LI HAONAN
CHEN JIA

Assignees

上海交通大学

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. The ultra-high resolution space transcriptome subcellular expression characteristic extraction method is characterized by comprising the following steps of: s1, loading space transcriptome data comprising a gene expression matrix and a space coordinate file; s2, performing quality control on the space transcriptome data; S3, carrying out standardization and logarithmic transformation treatment on a gene expression matrix in the space transcriptome data subjected to quality control; s4, packaging the space transcriptome data processed in the step S3 into a plurality of overlapping image data tiles by adopting a tilling strategy; S5, taking the overlapped image data tiles as input, and outputting a bin 1-level feature representation, namely a CLS Token and a Patch Token of each image data tile by using a ViT-L architecture-based feature extraction model.
2. The method for extracting the subcellular expression characteristics of the ultra-high resolution space transcriptome according to claim 1, wherein the quality control is specifically characterized in that the sequencing spots are filtered according to the total number of the detected sequencing spots and the total gene number, and meanwhile, the genes are filtered according to the number of times the genes are detected by the sequencing spots so as to remove low-quality data points and low-abundance genes.
3. The method for extracting the expression characteristics of the subcellular of the ultra-high resolution space transcriptome according to claim 1, wherein the step S4 specifically comprises the following steps: s41, mapping the standardized and logarithmically transformed gene expression matrix back to a corresponding space coordinate grid; s42, slicing on the space coordinate grid to generate Is a plurality of image data tiles; s43, in order to ensure the continuity of the context information, overlapping areas of preset number of sequencing points are arranged between adjacent tiles; S44, for sequencing points in the space coordinate grid without tissue coverage, filling the corresponding gene expression vector with all 0S, thereby converting the space transcriptome data into a data set of one image data tile Wherein each tile is Is one Is used for the tensor of (c), The number of genes after quality control.
4. The ultra-high resolution space transcriptome subcellular expression feature extraction method of claim 1, wherein the data enhancement is performed on overlapping image data tiles packaged by a Tiling strategy to generate a training set, and a feature extraction model based on ViT-L architecture is trained, and the data enhancement specifically comprises the following operations: random cutting, namely generating a global view and a local view of the image data tile through random cutting; Random overturning, namely carrying out random horizontal direction overturning on the image data tiles according to preset probability, and then carrying out random vertical direction overturning according to preset probability; numerical perturbation by multiplying the perturbation by a random number, the gene expression values in the image data tiles are multiplied by a random factor sampled from the uniform distribution.
5. The method of claim 1, wherein the feature extraction model uses ViT-L architecture as backbone network, comprises an encoder formed by stacking multiple layers of transducer blocks, and replaces absolute position encoding of conventional ViT with rotational position encoding with disturbance, and replaces a multi-layer perceptron in a feedforward network of conventional ViT transducer blocks with SwiGLUFFN modules.
6. The method for extracting the characteristics of the subcellular expression of the ultra-high resolution space transcriptome according to claim 1, wherein the characteristic extraction model is trained by adopting a self-supervised learning paradigm of self-distillation, and comprises the following steps: Defining a student network and a teacher network, wherein the student network and the teacher network have the same ViT-L architecture; Updating student network weights ; By means of student network weights Updating teacher network weights for exponential moving averages of (i) : , wherein, Is a momentum coefficient; Inputting the global view after data enhancement into a teacher network, inputting the global view and the local view after data enhancement into a student network, and performing model training by adopting a composite loss function, wherein the composite loss function is a weighted sum of global distillation loss, mask expression modeling loss and feature regularization loss, the global distillation loss calculates cross entropy between local view representation output by the student network and global view representation output by the teacher network, the mask expression modeling loss is obtained by carrying out random masking on input image data tiles of the student network, then calculating cross entropy between predictive representation of a masked patch by the student network and unmasked representation of the same patch by the teacher network, and the feature regularization loss is calculated by KL divergence among feature vectors in a batch.
7. The method for extracting the expression characteristics of the ultra-high resolution space transcriptome subcellular according to claim 6, characterized in that the global distillation loss is expressed as: ; Wherein, the For the global distillation loss to be present, Is the first The number of tiles of image data, As a data set of tiles of image data, Representing slave Cut out first The view of the individual local data tiles, Is the slave A view of the cropped partial data tile, And Output probabilities of teacher and student networks after softmax normalization respectively, Output for teacher model The prototype probability distribution of the individual tiles of image data, Output for student model The first image data tile Prototype probability distribution for each local data tile view.
8. The method for extracting the expression characteristics of the subcellular group of the ultra-high resolution space transcriptome according to claim 1, wherein the mask expression modeling loss is expressed as: ; Wherein, the The penalty is modeled for the mask expression, Is the first The number of tiles of image data, As a data set of tiles of image data, Is the first The mth masked patch of the image data tile, Is that Is used to determine the pattern of the mask, And Output probabilities of teacher and student networks after softmax normalization respectively, Output for teacher model Prototype probability distribution of the mth masked patch of the image data tile, Output for student model Prototype probability distribution of the mth masked patch of the image data tile.
9. The application of the ultra-high resolution space transcriptome subcellular expression characteristic extraction method is characterized by comprising the following steps: acquiring a bin1 level feature representation using the method of any one of claims 1-8; performing dimension reduction on the bin 1-level feature representation by adopting a principal component analysis method; Clustering the feature characterization of the bin1 level after the dimension reduction by using a graph clustering algorithm or a K-Means algorithm; Based on the clustering result, the marker genes with high specificity and high expression of each cluster are found out through differential expression analysis so as to reveal the biological functions and subcellular or cell type constitution of different space regions.
10. The use of a method for extracting characteristics of expression of subcellular transcriptome in ultra-high resolution space according to claim 9, said method further comprising: Further reducing the dimension of the bin1 level feature characterization to obtain a 3-dimensional feature characterization; Respectively carrying out minimum-maximum normalization on three principal components of the 3-dimensional feature representation, and scaling the value range to a [0,1] interval; the three dimensions after normalization are directly mapped into RGB color channels, and a pseudo color space image is generated to display the biological structure defined by the gene expression pattern inside the tissue without any prior knowledge or cell labeling.

Description

Ultra-high resolution space transcriptome subcellular expression characteristic extraction method and application thereof Technical Field The invention relates to a space transcriptome data processing method, in particular to an ultra-high resolution space transcriptome subcellular expression characteristic extraction method and application thereof. Background In recent years, spatial transcriptomics (Spatial Transcriptomics) technology has revolutionized allowing researchers to measure gene expression in a high throughput in the context of tissue in situ. This provides an unprecedented view for understanding tissue heterogeneity, cellular microenvironment, and disease progression. Recently, the resolution of spatial transcriptome techniques has increased to the subcellular level, such as the advent of Visium HD, which has made it possible to capture millions of sequencing spots of subcellular resolution on a single tissue section. This technological advance greatly expands the scale of subcellular spatial data. However, despite the breakthrough in hardware technology, the corresponding computational analysis method is relatively late. Chinese patent CN120220805A discloses a method and system for subcellular space transcriptome function analysis, including automating processing sample information, optimizing space transcriptome expression matrix data, performing gene ID conversion and cell unit type choice, statistics and visualization of space data characteristics, performing standardization and cluster analysis, and implementing subcellular space transcriptome function analysis steps, by means of a multi-module automation process, significantly improving analysis efficiency, accuracy and biological interpretation of space transcriptome and single cell data. However, prior art techniques similar to this approach suffer from the following drawbacks when processing ultra-high resolution spatial transcriptome data: 1. Data sparsity and Gao Weixing. At subcellular resolution, the number of RNA molecules captured per sequencing spot (bin) is very small, resulting in a gene expression matrix with a very high degree of sparsity. Meanwhile, the huge number of sequencing spots and high-dimensional gene characteristics bring great challenges to calculation. 2. Mismatch of analysis dimensions it is difficult for existing spatial transcriptome analysis algorithms (e.g. cell or spatial domain based algorithms) to directly process data at a single sequencing point (i.e. bin1 level). To overcome sparsity and computational complexity, they often employ data aggregation strategies, such as aggregating multiple adjacent bins into one unit (e.g., cellbin or binX strategies). 3. Resolution loss the isotropic analysis and sequencing point convergence strategy described above, while reducing the difficulty of data analysis, severely loses the ultra-high resolution that could be achieved by the original technique. This results in a smooth or masked subcellular fine spatial structure and gene expression pattern that does not take full advantage of the information gain of the new technology. In view of the foregoing, there is a strong need in the art for a method that can analyze ultra-high resolution spatial transcriptome data directly on the bin1 level to fully exploit biological information at the subcellular spatial scale. Disclosure of Invention The invention aims to provide a method for extracting subcellular expression characteristics of an ultra-high resolution space transcriptome and application thereof, and the ultra-high resolution space transcriptome data on a bin1 level is realized. The aim of the invention can be achieved by the following technical scheme: A method for extracting the expression characteristics of subcellular of a super-resolution space transcriptome comprises the following steps: s1, loading space transcriptome data comprising a gene expression matrix and a space coordinate file; s2, performing quality control on the space transcriptome data; S3, carrying out standardization and logarithmic transformation treatment on a gene expression matrix in the space transcriptome data subjected to quality control; s4, packaging the space transcriptome data processed in the step S3 into a plurality of overlapping image data tiles by adopting a tilling strategy; S5, taking the overlapped image data tiles as input, and outputting a bin 1-level feature representation, namely a CLS Token and a Patch Token of each image data tile by using a ViT-L architecture-based feature extraction model. The quality control method specifically comprises the steps of filtering the sequencing points according to the total quantity and the total gene quantity detected by each sequencing point, and simultaneously filtering genes according to the times of detecting the genes by the sequencing points so as to remove low-quality data points and low-abundance genes. The step S4 specifically comprises the following steps: s41, mapping the