CN-122024815-A - Single-cell multiunit science integration method based on priori knowledge and probabilistic reasoning

CN122024815ACN 122024815 ACN122024815 ACN 122024815ACN-122024815-A

Abstract

The invention discloses a single-cell multi-group chemical integration method based on priori knowledge and probabilistic reasoning, and belongs to the technical field of computational biology and bioinformatics. The method comprises the steps of preprocessing multi-group student single-cell data, constructing a cross-group student prior map based on genome coordinates, extracting feature embedding through a map variation self-encoder, respectively learning cell embedding representations of all modes by utilizing a group student specific variation self-encoder, and fusing multi-group student posterior distribution by adopting an expert product strategy to generate a unified co-embedding matrix. According to the invention, by introducing biological priori knowledge to guide feature alignment and combining a probability generation model and an expert fusion mechanism, commonality and individual features of multiple groups of chemical data are simultaneously modeled under a unified frame, and integration robustness and downstream analysis accuracy of high-noise sparse single-cell data can be effectively improved.

Inventors

GUO YINGJIE
ZHOU JIN
CHENG CHENYANG
LIANG ZHEN

Assignees

山西大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (4)

1. A single-cell multi-group chemical integration method based on priori knowledge and probabilistic reasoning is characterized by comprising the following steps: Step1, multi-study data acquisition and pretreatment: Acquiring multiple sets of data from the same cell population, each set corresponding to a data matrix Wherein N is the number of cells, Performing quality control, normalization processing and feature screening on each group of data to obtain preprocessed groups of data; step 2, constructing a characteristic association diagram based on biological priori: Uniformly mapping the characteristics of different groups to the same biological reference space, and constructing a cross-group characteristic association diagram based on biological priori rules Directing feature alignment across the histology layers in the potential feature space, wherein, Representing a set of feature nodes; Representing a set of edges between features; in one embodiment, the association weights between features may be defined based on genomic coordinate distances as: Wherein, the Representing characteristics And features The relative distance in genome coordinates, Is the distance attenuation coefficient Step 3, feature embedding learning guided by the feature association diagram: based on the feature association diagram constructed in the step 2, generating feature embedding representation by adopting a diagram structure modeling method: Wherein, the Representing a feature embedding matrix, m being the embedding dimension, Is a feature mapping function based on a graph structure; In one embodiment, feature embedding is modeled to satisfy the following probability distribution: Optimizing by minimizing reconstruction errors and regularization terms; step 4, histology specific cells represent learning: For each of the omics data, learning a omic specific cellular potential representation based on the corresponding pre-processed data and the feature embedded representation: Wherein, the Represents a potential representation of cells under the kth histology; in one embodiment, the cellular potential representation is co-involved with feature embedding through an inner product form in the data reconstruction process: Wherein, the Mapping functions matched with the group chemical distribution characteristics; Step 5, multiple sets of mathematical cells based on probability fusion represent integration: Introducing a probability fusion mechanism to generate a unified multi-group cell representation based on the multi-group cell potential representation obtained in the step 4; In one embodiment, the potential representations of cells of different sets are considered to be multiple sources of probability information, whose joint posterior can be expressed as: Wherein, the Is a preset prior distribution; the potential representation of the fused cells can be determined by their statistical characteristics: step 6, outputting a multi-set of chemical co-embedding representations and for downstream analysis: Constructing a multi-group chemical co-embedding matrix based on the unified cell potential representation obtained in the step 5: The co-embedding matrix is used to support cell clustering, cell type annotation and downstream analysis tasks of cell status.
2. The method for integrating multiple single cells based on priori knowledge and probabilistic reasoning as set forth in claim 1, wherein the step of constructing the biological prior map in step 2 constructs the biological prior map carrying the genomic association between features by calculating the genomic coordinate relative distance between features.
3. The method for single-cell multi-group learning integration based on priori knowledge and probabilistic reasoning as set forth in claim 1, wherein the defining step of the group learning specific data decoder in step 4 sets the group learning specific data decoder as an inner product function of cell embedding and feature embedding to guide the alignment of different group learning feature spaces.
4. The method for integrating single-cell multiple-group learnt based on priori knowledge and probabilistic reasoning according to claim 1, wherein in the step 5, the step of learning the shared features of the groups based on expert product strategies is performed, and the expert product strategies are combined with posterior distributions of different groups to infer joint posterior distributions of all groups so as to learn the shared features among the data of the groups.

Description

Single-cell multiunit science integration method based on priori knowledge and probabilistic reasoning Technical Field The invention relates to the technical field of single-cell multi-group data integration and analysis of computational biology, in particular to a biological priori knowledge guided graph variation self-encoder and expert product strategy group data fusion method. Background The single-cell multi-histology integration technology aims at carrying out cooperative analysis on heterogeneous histology data (such as transcriptome, apparent group, proteome and the like) of the same cell so as to reveal the inter-histology regulation rule of cell states and functions. Along with the breakthrough of the high-throughput sequencing technology, acquisition of single-cell multi-group data is possible, but the inherent characteristics of the data provide challenges for a calculation integration method, namely on one hand, the dimension and distribution difference of different groups of data are obvious, so that cross-modal characteristic alignment is difficult, and on the other hand, the integration complexity is further aggravated by high noise and sparsity of the data. Traditional single-cell multiunit chemical integration methods are mostly based on linear modeling strategies. However, limited by the premise of linear assumptions, such methods have difficulty capturing complex nonlinear relationships between groups and have inadequate generalization capability in processing large-scale high-heterogeneity data. The introduction of deep learning obviously enhances the nonlinear modeling capability of the integration method and provides a new idea for the integration of multiple groups of learning data. These methods often employ a self-encoder or a variable self-encoder architecture to integrate multiple sets of chemical data by building a unified potential space between the chemical features of the set and to enhance the reconstruction capability by means of a set-chemical commonality or specificity decoder. Although the integration performance of the method is improved, the commonality and the specificity expression among the histology are difficult to be combined. More importantly, most current integration methods do not fully exploit the biological prior knowledge contained in single-cell multi-set of chemical data, such as the positional relationship among inter-set of chemical features. Disclosure of Invention The invention aims at providing a single-cell multiunit science integration method based on priori knowledge and probabilistic reasoning aiming at the problems. The technical scheme adopted by the invention is that the single-cell multi-group learning integration method based on priori knowledge and probabilistic reasoning comprises the following steps: step 1, multiple sets of chemical data acquisition and preprocessing Acquiring multiple sets of data from the same cell population, each set corresponding to a data matrix:; Wherein, the In order to obtain the number of cells,Is the number of features of the kth group. And respectively performing quality control, normalization processing and feature screening on each group of data to obtain preprocessed groups of data. Step 2, constructing a characteristic association diagram based on biological priori: Uniformly mapping the characteristics of different groups to the same biological reference space, and constructing a cross-group characteristic association diagram based on biological priori rules Directing feature alignment across the histology layers in the potential feature space, wherein,Representing a set of feature nodes; Representing a set of edges between features; in one embodiment, the association weights between features may be defined based on genomic coordinate distances as: Wherein, the Representing characteristicsAnd featuresThe relative distance in genome coordinates,Is the distance attenuation coefficient Step 3, feature embedding learning guided by feature association diagram Based on the feature association diagram constructed in the step 2, generating feature embedding representation by adopting a diagram structure modeling method: Wherein, the Representing a feature embedding matrix, m being the embedding dimension,Is a feature mapping function based on a graph structure; in one embodiment, feature embedding may be modeled to satisfy the following probability distribution: And optimized by minimizing reconstruction errors and regularization terms. Step4, histology specific cell representation learning For each of the omics data, learning a omic specific cellular potential representation based on the corresponding pre-processed data and the feature embedded representation: Wherein, the Represents a potential representation of cells under the kth histology. In one embodiment, the cellular potential representation is co-involved with feature embedding through an inner product form in the data reconstruction process: Where