CN-122024857-A - Single-cell multi-group dimension reduction method based on Gaussian process hidden variable model

CN122024857ACN 122024857 ACN122024857 ACN 122024857ACN-122024857-A

Abstract

The invention discloses a single-cell multi-group chemical dimension reduction method based on a Gaussian process hidden variable model, and aims to solve the problems that the prior art is difficult to meet the requirements of single-group and multi-group chemical data analysis, and the single-cell data high sparsity and technical noise cannot be effectively applied. According to the invention, the shared potential space probability generation framework is constructed, the variable inference and sparse Gaussian regression strategy are combined, the data of different modes are mapped to a unified low-dimensional space in a nonlinear manner, and meanwhile, adjustable mode weights are introduced, so that the model can carry out high-efficiency dimension reduction on single-cell transcriptome and multiple groups of mathematical data. The invention can be used for carrying out dimension reduction on single-cell sequencing data, can be used for completing downstream tasks such as cell clustering, visualization, differential expression analysis and the like by using low-dimension characterization learned by a model, accurately reveals cell heterogeneity and potential biological regulation signals, and has good application prospect.

Inventors

ZHANG JIE
GUAN LE

Assignees

南京大学

Dates

Publication Date: 20260512
Application Date: 20260331

Claims (6)

1. A single-cell multi-group chemical dimension reduction method based on a Gaussian process hidden variable model is characterized by comprising the following steps: Step 1, acquiring single-cell transcriptome or multiunit data and preprocessing the data; Step 2, constructing a single-cell multi-group Gaussian process hidden variable dimension reduction model and solving an objective function; Step 3, initializing parameters contained in the single-cell multi-group Gaussian process hidden variable dimension reduction model obtained in the step 2; and step 4, adopting a gradient-based optimization algorithm to maximize an objective function, and training a single-cell multi-group Gaussian process hidden variable dimension reduction model to obtain dimension reduced single-cell sequencing data.
2. The method of single-cell multi-set dimension reduction based on a gaussian process hidden variable model according to claim 1, wherein step 1 comprises: Step 1.1, screening characteristics of the histology data, if the input multiple sets of the histology data comprise chromatin accessibility data, mapping reads of the input multiple sets of the histology data to a gene region to construct a gene activity matrix, reserving all antibody labels for the proteome data, screening a plurality of high variant genes as representative characteristics on the basis of zero expression proportion and average expression level of genes for the transcriptome and the gene activity matrix transformed by the chromatin accessibility data, wherein the screened high variant genes are required to meet the conditions that the zero expression rate is larger than an exponential decay threshold value determined by the average expression level; Step 1.2, normalizing and logarithmically transforming the data, filtering genes or antibodies that are not counted in all cells to remove invalid features, calculating library size for each cell And dividing it by the median of all cell library sizes Obtaining the scale factor of each cell Dividing the original count by Normalizing to eliminate difference of sequencing depth between cells, and normalizing the data Performing logarithmic transformation: ; step 1.3, data standardization is carried out, the data after logarithmic transformation in step 1.2 is further scaled to zero mean and unit variance, and each mode data matrix is obtained Wherein In order to obtain the number of cells, Is the first Characteristic number of each modality.
3. The method for reducing dimension in single cell multi-group chemistry based on the hidden variable model of gaussian process according to claim 2, wherein the condition for screening high variant gene in step 1.1 is expressed as: Wherein, the Representing the gene At the position of Zero expression rate in the individual cells and, Is the first in the modal data matrix Row of lines The value of the expression of the column, Is an indication function; Representing the gene The average expression level in the cells in which expression was detected, In order to provide the attenuation parameter(s), In order to be able to carry out the intercept parameters, And iteratively adjusting through binary search to enable the number of the selected genes to reach a preset value.
4. The method of single-cell multi-set dimension reduction based on a gaussian process hidden variable model according to claim 1, wherein step 2 comprises: step 2.1, all The universities share a low-dimensional hidden variable Wherein For potential spatial dimensions, observations of each modality Is considered to be generated by a shared hidden variable through a nonlinear mapping and superimposed with gaussian noise, the process is expressed as: Wherein the noise term The ith row j column element of (2) Gaussian distribution compliant with independent same distribution , Noise precision for the corresponding mode; is a mapping function defined a priori by a Gaussian process, the first of which Dimension(s) , Is a kernel function, from which the first can be obtained Group modality No Dimensional observation data Edge likelihood of (a): Is that The order unit matrix is composed of all features in the same mode Mutually independent and available modes Likelihood function of (2) is Hidden variable Is arranged as a standard gaussian distribution , Is the first The hidden variable of the individual cells is defined, Is that A rank identity matrix; step 2.2, introducing a variation distribution Approximate true posterior Wherein the mean value Sum covariance diagonal matrix Is a variation parameter to be optimized, introduces modal weight Constructing a weighted likelihood function: the optimization target for obtaining the hidden variable dimension reduction model of the single-cell multi-group Gaussian process is to maximize the variable lower bound : Wherein, the Is a variation distribution With a priori distribution KL divergence of (2); Step 2.3, solving the variance lower bound by using sparse Gaussian regression, introducing Induction points Corresponding induced variable To approximate the real Gaussian process, reduce the calculation complexity by introducing the variation distribution of the induced variables To deduce Is defined by the closed lower bound of: In the middle of Is the trace of the matrix, Is the first The noise accuracy of the individual modes, , , Is the covariance matrix between the points of induction, And Substituting the cross covariance matrix between the hidden variable and the induction point into the variation lower bound formula of the step 2.2 to obtain the final objective function lower bound to be optimized: Wherein, the By using The variance expectation of the expected kernel matrix representing the variance distribution: 。
5. the method of single-cell multi-set dimension reduction based on a gaussian process hidden variable model according to claim 1, wherein step 3 comprises: step 3.1, initializing super parameters, wherein the super parameters required to be set by the hidden variable dimension reduction model of the single-cell multi-group Gaussian process comprise the number of induction points Dimension of potential space And noise accuracy of each mode And modal weights If the data acquired in the step 1 is single-group data, the modal weights of other groups are required to be set to zero: If the acquired data is multi-group data, initializing each mode weight as The specific numerical value can be adjusted according to the data scale and priori knowledge; Step 3.2, initializing hidden variables according to the modal weight Performing weighted splicing of characteristic dimensions on the preprocessed multiple groups of chemical data, performing principal component analysis on the spliced matrix, and performing front-end analysis Individual principal components as a mean of the variational distribution Initial value of variation covariance Randomly sampling the initial values of (a) from the uniform distribution; Step 3.3, initializing kernel function parameters, setting corresponding parameter initial values according to the actually adopted kernel function, if the kernel function is determined by using automatic correlation, analyzing the proportion of each principal component interpretation variance according to the principal components in step 3.2, and taking the reciprocal thereof as the length scale of the kernel function Is set to an initial value of (1); Step 3.4, initializing the induction point, and obtaining the initial hidden variable variation distribution mean value in step 3.2 Performing K-means clustering on the obtained product The center of each cluster is used as an induction point Is set to be a constant value.
6. The method of single-cell multi-set dimension reduction based on a gaussian process hidden variable model according to claim 1, wherein step 4 comprises: Calculating an objective function in the step 2.3 by taking the parameters set in the step 3 as starting points, and iteratively updating model parameters by adopting a gradient-based optimization algorithm, wherein the model parameters comprise variation parameters Kernel parameters, induction points and noise accuracy The variation distribution is obtained after training is completed until the objective function value meets the preset convergence condition or the maximum iteration times Mean of (2) As a final low dimensional characterization of each cell, dimension reduction of single cell multi-set data was completed.

Description

Single-cell multi-group dimension reduction method based on Gaussian process hidden variable model Technical Field The invention relates to the technical field of single-cell sequencing data analysis, in particular to a single-cell multi-group chemical dimension reduction method based on a Gaussian process hidden variable model, and belongs to the field of intersection of bioinformatics and machine learning technology. Background The single-cell transcriptome sequencing (scRNA-seq) technology improves the accuracy of transcriptome analysis to the single-cell level, can quantitatively analyze the gene expression states in thousands of cells, and provides important technical support for deeply understanding cell behaviors, revealing cell differentiation tracks and finding novel cell subtypes. However, since the RNA content of individual cells is extremely low and limited by sequencing depth and reverse transcription efficiency, part of the underexpressed genes are difficult to detect, resulting in a large number of zero values in the count matrix. In addition, the unavoidable technical noise in the sequencing experiment further masks the real biological signals, so that the single-cell transcriptome sequencing data presents complex statistical characteristics of high dimension, high sparsity, high noise, excessive dispersion and the like, and the execution difficulty of downstream tasks such as dimension reduction, clustering, differential expression analysis and the like is remarkably improved. With the continued iteration of single cell sequencing technology, the research perspective has extended from a single transcriptome level to multi-modal collaborative analysis. The regulation of cell status is determined by a plurality of dimensions of transcriptome, epigenomic, proteome, etc. It is difficult to fully analyze the molecular regulatory mechanism of cells using only single cell transcriptome sequencing data, so it is an important development direction of single cell research to measure histology information of multiple layers simultaneously in the same cell. In recent years, researchers develop a plurality of multi-mode sequencing technologies, such as CITE-seq technology, which can simultaneously detect mRNA expression and cell surface protein abundance in the same cell by utilizing an antibody derived tag, and SMAGE-seq technology can synchronously acquire gene expression profile and chromatin accessibility information in a single cell, thereby providing a unique view for researching the association between apparent regulation and transcription output. Data of different modalities each have merits and have intrinsic complementarity. The deletion rate of the protein mode is generally lower, the problem of information loss in transcriptome data can be solved to a certain extent, but the detection flux is limited, the discovery capability of rare cell subtypes is insufficient, transcriptome data covers the high-dimensional characteristics of the whole genome, the detection of fine cell subtypes is facilitated, and chromatin accessibility information can reflect cell type specific regulatory activity. Therefore, the integration of multiple sets of chemical data can theoretically enhance the reliability and accuracy of the analysis results, and is expected to achieve higher cell typing resolution than single sets of chemical analysis. The multi-modal sequencing technology provides more comprehensive molecular information for single cell research, but the complexity and specificity of the multi-set of chemical data also presents new challenges to this field. How to efficiently capture the internal relations among different groups of data and realize the effective integration and dimension reduction of the data is a core difficulty of multi-group analysis. The existing methods have various emphasis but commonly have defects. Seurat is a multi-group learning integration method based on similarity network fusion, which utilizes a Weighted Nearest Neighbor (WNN) algorithm to distribute weights of different modes to each cell, and constructs a similarity graph among cells through weighted combination to integrate multi-group learning information. The design of the method does not fully consider the high-deletion rate characteristic of the data, and the similarity calculation is easy to be interfered by noise. scMDC is a method for deep learning by combining optimized embedded learning and clustering targets, which uses a single encoder to process spliced multi-modal data, and uses a plurality of decoders to reconstruct each modal data based on zero-expansion negative binomial distribution. The method realizes end-to-end feature learning and clustering optimization, and obtains leading performance on a downstream clustering task, but the model architecture and the optimization target of the method depend on multi-mode input, and are difficult to be compatible with the analysis requirement of single-group data. Di