CN-122024858-A - Space multi-set analysis method based on probability frame and generalized principal component analysis principle

CN122024858ACN 122024858 ACN122024858 ACN 122024858ACN-122024858-A

Abstract

The invention discloses a space multi-discipline analysis method based on a probability framework and a generalized principal component analysis principle, which aims at the core challenges of high dimensionality, modal heterogeneity and space complexity of space multi-discipline data, and realizes efficient integration and accurate space analysis of multi-modal data. The method comprises the steps of preprocessing data, completing feature screening, normalization and converting the data into a matrix format, distributing independent space Gaussian process prior for each potential factor in model training, adopting a weighted joint likelihood function to realize multi-mode collaborative modeling, completing efficient parameter inference through step-by-step iterative optimization, outputting a low-dimensional potential factor matrix integrating space information and multi-mode features, and realizing standardized biological analysis such as space domain detection, clustering quantitative evaluation, space visualization, difference feature screening, functional enrichment, cell interaction analysis and the like.

Inventors

ZHANG JIE
HE JIANUO

Assignees

南京大学

Dates

Publication Date: 20260512
Application Date: 20260402

Claims (6)

1. A space multi-component analysis method based on a probability framework and a generalized principal component analysis principle is characterized by comprising the following steps: Step 1, preprocessing data, namely acquiring data of each mode, performing sample rejection preprocessing, performing targeted feature screening aiming at different groups of chemical modes, and normalizing and converting the multi-mode data; constructing a model, namely constructing a space multi-group analysis probability model by taking a measurement matrix of the preprocessed data and space coordinates as inputs; Step 3, parameter inference, namely sequentially completing inference of factor loading matrix, noise variance and space covariance parameters by adopting a step iteration and block optimization strategy based on the space multi-group analysis probability model constructed in the step 2, and finally outputting stable and convergent low-dimensional potential factor characterization; And 4, downstream biological analysis, namely, the step 3 takes the low-dimensional potential factors output by the step as input to finish standardized spatial multi-group biological analysis.
2. The method for spatial multi-set analysis based on the principle of probabilistic framework and generalized principal component analysis according to claim 1, wherein the specific steps of step 1 are as follows: step 1.1 data acquisition and sample quality control Obtaining any spatial multi-group data set containing transcriptome, proteome or apparent group and two-dimensional or three-dimensional spatial coordinate information corresponding to all detection sites, removing invalid samples, removing samples with missing spatial coordinates and abnormal spatial coordinates, detecting and removing samples deviating from overall data distribution through outlier; Step 1.2 Modal differentiation feature screening and Cross-Modal alignment Performing targeted feature screening for different sets of modalities: transcriptome mode, namely preferentially screening space variable genes or supplementing selected high variable genes, and eliminating identified low-expression and non-expression invalid genes; a proteome and apparent set mode, namely only eliminating the set low detection rate characteristics; After screening, cross-mode consistency verification is performed, the dimensionality of each mode sample set is aligned, and the one-to-one correspondence of the samples in multi-mode joint modeling is ensured; step 1.3 Multi-modal data normalization and Format conversion According to the mode data characteristics, a normalization method is selected, and dimension differences and technical deviations are eliminated: Conventional transcriptome data, namely adopting logarithmic normalization processing; high sparse and low expression transcriptome data, namely regularized negative binomial transformation is adopted; a variance stabilization is needed to analyze a scene by variance stabilization transformation; And finishing the whole pretreatment flow, and taking the normalized molecular characteristic data and the spatial coordinate information as the input of a spatial multi-group analysis probability model.
3. The method for spatial multi-set analysis based on the principle of probabilistic framework and generalized principal component analysis according to claim 1, wherein in the step 2, the measurement matrix is pre-processed in the step 1 Spatial coordinates , For input, constructing a spatial multi-set of analytical probability models , For the kth modal observation vector, The matrix is loaded for the modality-specific factors, D-dimensional latent factor vectors shared for all modalities, The method comprises the following steps of: step 2.1 set spatial Gaussian Process prior Spatially multi-group analysis probability model for each potential factor Independent spatial gaussian process priors are allocated separately: ; is the first Independent space covariance matrixes corresponding to the potential factors; step 2.2 construction of covariance matrix Constructing a spatial covariance matrix by adopting Matern kernel functions with balanced smoothness and numerical stability, wherein the covariance matrix element expression corresponding to the first potential factor is as follows Wherein Independently learnable length scale parameters for the first latent factor for controlling the decay rate of the spatial correlation; step 2.3 weighted joint likelihood Functions Building weighted joint log-likelihood, defaulting to each modal weight = 1: For the weight of the kth modality, , , Representing the spatial covariance length-scale parameters, , For the modal noise variance to be the same, The matrix is loaded for factors.
4. The method for spatial multi-component analysis based on the probability framework and the principle of generalized principal component analysis according to claim 1, wherein in the step 3, based on the spatial multi-component analysis probability model constructed in the step 2, a step iteration and block optimization strategy is adopted to sequentially complete the inference of factor loading matrix, noise variance and spatial covariance parameters and finally output stable and convergent low-dimensional potential factor representation, and the method comprises the following steps: Step 3.1 parameter initialization For spatial covariance length scale parameters , , Variance of each mode noise Initializing; step 3.2 intra-modality parameter update Fixing Iterative optimization is performed separately for each modality k: Step 3.2.1 updating the factor loading matrix In the orthogonal constraint Maximizing marginal log likelihood to obtain Is a function of the estimated value of (2); Wherein the method comprises the steps of Is that Is the first of (2) Column vector corresponding to the first A loading vector of the potential factors in a k-th modality; is the first Spatial covariance matrix corresponding to each potential factor A feature vector matrix obtained by feature decomposition; Is that Is characterized by the ith feature value of (2) The spatial variance contribution of the individual potential factors on sample i; Is based on As diagonal elements Is a diagonal matrix of (a); step 3.2.2 updating noise variance To obtain Minimizing the negative marginal log likelihood The minimization method is to derive the derivative from the objective function Solving zero points of the derivative by using a Brint method to obtain optimal noise variance of each mode; Step 3.3 covariance parameters Updating Fixing , Based on weighted joint log-likelihood, for each latent factor, a length scale parameter Independent parallel optimization is carried out; step 3.4 Global iteration and Convergence decision Circularly executing the parameter updating and the space covariance parameter updating in the mode until the log likelihood change is smaller than a preset threshold value, and judging that the model converges; step 3.5 latent factor posterior inference Deriving potential factors using all parameters after convergence Posterior distribution of (2), taking posterior expectation as final output to obtain unified low-dimensional potential factor matrix simultaneously integrating multi-modal information and spatial information : 。
5. The method of spatial multi-discipline analysis based on probabilistic framework and generalized principal component analysis principles according to claim 1, characterized in that said step 4 comprises the steps of: step 4.1 spatial Domain detection Taking the integrated space information output in the step 3 and the low-dimensional potential factor matrix with multi-modal characteristics as input, carrying out unsupervised clustering on all space loci by adopting a K-means clustering algorithm, dividing the organization into space domains with continuous space and homologous functions, obtaining space domain labels corresponding to each locus, and directly setting the clustering number K as the number of real areas or the number of cell types; step 4.2 quantitative evaluation of clustering performance and spatial structure Quantifying the consistency of a clustering result and a real cell type or tissue region by adopting ARI, NMI, AMI indexes, evaluating the global aggregation of a spatial domain by using Moran's I index, measuring the homogeneity of a local region by using LISI index, and comprehensively evaluating the capture precision and clustering reliability of a model on a spatial structure; step 4.3 Multi-dimensional spatial visualization Mapping the spatial domain labels, the potential factor distribution and the feature expression quantity to original tissue space coordinates, generating a spatial domain distribution map and a feature space heat map, simultaneously reducing the low-dimensional potential factors to two dimensions through UMAP algorithm, drawing a spatial domain distribution scatter diagram, and visually presenting a tissue space architecture and a molecular feature distribution pattern; step 4.4 spatial specificity Difference feature screening Screening genes, proteins or apparent modification sites which are remarkably and differentially expressed in each spatial domain by adopting Wilcoxon rank sum test by taking the spatial domains as a group, setting the test confidence coefficient to be 95%, carrying out double screening corresponding to a corrected P value threshold value P <0.05, and carrying out double screening by combining multiple change more than 0.25 at the same time, and strictly controlling false positive rate to obtain a spatial specificity molecular marker list for subsequent functional analysis and biological verification; Step 4.5 functional enrichment resolution and biological annotation Performing functional enrichment analysis on the screened space specificity markers based on GO, KEGG, MSigDB database, identifying the biological process, signal path and cell type characteristics of significant enrichment of each space domain, and completing biological functional annotation by combining tissue anatomy structure to reveal physiological and pathological significance of the space domain.
6. The space multi-group analysis system based on the probability framework and the generalized principal component analysis principle is characterized by comprising a data preprocessing module, a model training inference module and a downstream analysis module; The data preprocessing module acquires space multiunit chemical data, eliminates invalid samples and samples with missing and abnormal coordinate information in the space multiunit chemical data, removes samples deviating from overall distribution through outlier detection, performs differential characteristic layer screening aiming at different groups of chemical modes, preferentially screens space variable genes or supplements and screens high variable genes, eliminates invalid genes, eliminates low detection rate characteristics which do not meet a preset threshold value for protein group and apparent group modes, reserves a complete characteristic set, performs cross-mode consistency check, aligns sample sets after screening of each mode to ensure sample dimension consistency, normalizes different data characteristics aiming at the transcriptome, the protein group and the apparent group, fuses molecular characteristic data and space coordinate information, and converts the molecular characteristic data into matrix format by regularized negative binomial transformation; The model training inference module takes the preprocessed multi-modal matrix as input, constructs a space multi-academic analysis probability model, sets independent space Gaussian process prior for each potential factor to capture heterogeneous space association, realizes multi-modal collaborative modeling through weighted joint likelihood, completes parameter inference through step-by-step iterative optimization, and finally outputs a low-dimensional potential factor matrix integrating space information and multi-modal characteristics to provide core characterization for downstream analysis; The downstream analysis module takes the fusion space information output by the model training inference module and the low-dimensional potential factor matrix of the multi-mode features as core input, completes space domain detection, clustering quantitative evaluation, space visualization, difference feature screening, function enrichment analysis and cell interaction analysis according to the flow, and finally outputs the standardized biological results which can be directly used for scientific research papers and subject researches.

Description

Space multi-set analysis method based on probability frame and generalized principal component analysis principle Technical Field The invention relates to a space multi-group analysis method based on a probability framework and a generalized principal component analysis principle, belonging to the technical field of intersection of computational biology, bioinformatics and statistical machine learning. The method has the core that spatial multi-group data (such as transcriptomics, apparent histology, proteomics and the like) are processed through a probability framework, so that calculation challenges brought by high dimensionality, modal heterogeneity and spatial complexity are solved, the accurate analysis of biological problems such as cell tissue architecture, spatial domain detection and clustering is realized, and technical support is provided for tissue biological modeling, disease research and the like. Background In recent years, space multiunit technology has been developed remarkably, and information on a plurality of molecular layers (such as transcriptomics, apparent histology, proteomics and the like) can be measured simultaneously on the premise of preserving the original space background of a sample, so that a powerful research means is provided for analyzing cell tissue architecture, cell heterogeneity and functions. Such datasets greatly facilitate understanding of development, disease progression, and tissue microenvironment by integrating the intrinsic regulatory states of cells with extrinsic spatial signals. From the technical principle, the core value of the spatial multi-group data is that the spatial multi-group data simultaneously comprises multi-mode information and spatial position information at a molecular level. However, the analysis faces three major core challenges, namely high dimensionality, high feature (such as genes and proteins) in each group of chemical modes, modal heterogeneity, significant differences in noise distribution, resolution and dynamic range of different groups, and spatial complexity, and complex spatial dependence of adjacent cells or regions due to biological correlation. Currently, integrated analysis of spatial multi-mathematics data still lacks an efficient and flexible method, and the following key problems exist in the prior art: The spatial background is ignored or limited to single mode, and the non-spatial single cell multi-group chemical integration method (such as Seurat v, totalVI, scMDC) can process multi-mode data, but does not consider spatial dependence, the cells are regarded as exchangeable units, and key spatial information is lost. While spatial perception tools (e.g., SPATIALPCA) spatially rely on gaussian processes or graph-based prior modeling, they are only suitable for single-mode data and cannot be directly extended to multiple-set-of-chemistry scenarios. Assuming that the latent variables share the same spatial correlation structure, most existing methods (including SPATIALPCA, probabilistic PCA, variational self-encoders, etc.) assume that all the latent dimensions follow the same spatial covariance structure, this simplifying assumption is not consistent with biological reality—different regulatory axes may exhibit distinct spatial patterns (e.g., different scales or anisotropies), severely limiting the expressive power of the model. The multi-mode expansibility is poor, the deep learning method (such as SpaVAE, spatialGlue) can process bimodal data, but spatial smoothness is realized through parallel encoder-decoder branches and modification of a loss function, the architecture is stiff, the expansion of the multi-mode expansibility to more than two modes is difficult, the network needs to be manually reconstructed and super parameters need to be adjusted during the expansion, and the practicability is limited. The calculation efficiency and flexibility are insufficient, and although part of methods (such as MEFISTO) can integrate multi-mode space data, the calculation time is obviously longer, and the performance is inferior to other methods in part of scenes, so that the analysis requirement of large-scale data is difficult to meet. The defects cause that the prior art cannot effectively integrate heterogeneous multi-group chemical data and accurately capture complex spatial modes, which is the core motivation of the invention, namely developing an integration framework which can flexibly model the heterogeneous spatial correlation of potential dimensions and support any number of group chemical modes. Disclosure of Invention Aiming at the problems and defects existing in the prior art, the invention provides a space multi-group analysis method based on a probability framework and a generalized principal component analysis principle, which aims to solve the problems of space dependency modeling deficiency, insufficient multi-mode heterogeneity processing, limited complex space structure characterization capability and th