CN-121999873-A - Domain-adaptive single cell classification method capable of learning map
Abstract
The invention discloses a domain-adaptive single-cell classification method of a learner map, which adopts a public single-cell transcriptome database, adopts a leave-one method to divide an original data set into a reference data set (with cell type labels) and a query data set (without cell type labels), constructs a domain-adaptive single-cell classification network of the learner map, constructs a unified joint loss function, inputs the divided reference data set and the query data set into the constructed single-cell classification network based on map structure learning and map domain adaptation, carries out model training by utilizing total loss function minimization, and inputs a new data set of an individual with the same cancer type as the query cell set into the domain-adaptive single-cell classification network of the learner map to obtain predicted cell type classification. According to the method, through introducing dynamic graph structure learning and graph domain adaptation, core defects in two aspects of static graph structure and weak cross-sample generalization capability in the prior art are effectively overcome, and accuracy and generalization capability of single cell classification are improved.
Inventors
- LI YUECHAO
- HE JIAJUN
- You Hairu
- SUN XIAO
- LI JIAYUAN
- HUANG YUAN
- HUANG ZHIAN
- YOU ZHUHONG
Assignees
- 西北工业大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260112
Claims (5)
- 1. A domain-adaptive single cell classification method for a learner map, comprising the steps of: step 1, adopting a public single cell transcriptome database to cover three cancer types, namely leukemia, breast invasive cancer and colorectal cancer; Wherein the sample data of each patient is an independent single cell RNA sequencing scRNA-seq expression matrix, the behavior cells are listed as genes, the matrix elements are the expression count counts of the genes in each cell, and each matrix is accompanied by a cell type label corresponding to the matrix' Dividing an original data set into a reference data set, namely a cell type label and a query data set, namely a cell-free type label by adopting a leave-one-out method, taking data of one individual as the reference set and the other individual as the query set for a patient sample of the same cancer type, and repeating the process among all possible individual pairs; Step 3, constructing a single-cell classification network based on graph structure learning and graph domain adaptation, wherein the network comprises a reference set cell-cell graph module, a query set cell-cell graph module, a dynamic graph structure learning module and a graph domain adaptation module; the reference set cell-cell map module and the query set cell-cell map module have the same structure, the reference set cell-cell map module is input as a reference data set, and the query set cell-cell map module is input as a query data set; step 3-1, constructing a reference set cell-cell map module; The reference set cell-cell map module maps the original gene expression matrix of the reference data set Transformation into reference set initial cell-cell map , wherein, , For the reference set of primitive cell numbers, For the reference set of original base factors, Node set Representing reference set cells, edge set Representing the interaction relationship between cells of a reference set, edge set Using adjacency matrices The representation is made of a combination of a first and a second color, The reference set cell number after data pretreatment; Step 3-1-1, preprocessing data; original Gene expression matrix to reference data set Standardized hypervariable gene expression matrix for conversion into reference set Original gene expression matrix As input to a graph structure learning single cell classification network adapted to a graph domain, wherein, , Cell filtration, gene filtration, library size normalization, logarithmic transformation and feature screening; (1) For the original gene expression matrix Cell filtration and gene filtration are carried out to remove the original gene expression matrix Cells with medium gene numbers less than 200 and greater than 2500 and genes expressed in less than 3 cells; The steps of cell filtration are: calculation of the original Gene expression matrix Number of non-zero elements of each row vector Delete And Is a row of (2); the gene filtration step is to calculate the original gene expression matrix Number of non-zero elements of each column vector in (a) Delete Is a column of (2); (2) Library size normalization and logarithmic transformation; Matrix of original Gene expression Performing standard library size normalization and logarithmic transformation on the results after cell filtration and gene filtration to obtain a standardized gene expression matrix Library size normalization and logarithmic transformation processing eliminates technical noise and stabilizes variance, wherein The number of the reference set genes after the gene filtration; (3) Screening characteristics; feature screening based on analysis of variance, calculation of normalized gene expression matrix The variance of the expression level of each gene in all cells is arranged in descending order, and the top 3000 genes with the highest variance in descending order are selected to obtain the final reference set standardized hypervariable gene expression matrix ; Step 3-1-2, initial cell-cell map generation; Normalization of hypervariable gene expression matrices with reference sets For input, use Algorithm construction of reference set initial cell-cell map Using adjacency matrix Indicating, setting In the algorithm ; Step 3-2, constructing a query set cell-cell map module; The query set cell-cell map module queries the original gene expression matrix of the dataset Transformation into Inquiry set initial cell-cell map (, Adjacency matrix) Representation) of the composition, wherein, , In order to query the number of primitive cells in the collection, In order for the query set to have the original base factors, Node set Representing query set cells, edge set Representing the interaction relationship between cells of a query set, edge set Using adjacency matrices The representation is made of a combination of a first and a second color, The cell number is collected for the query after the data pretreatment; The query set cell-cell map module and the reference set cell-cell map module have the same processing flow, and the query set standardized hypervariable gene expression matrix can be obtained in the steps ; Step 3-3, constructing a dynamic diagram structure learning module; dynamic diagram structure learning module for reference set initial cell-cell diagram (With adjacency matrix) Representation) as input, reference set normalized hypervariable gene expression matrix For node characteristic matrix, using reference set cell type label matrix As a supervision signal, a two-layer graph rolling GCN network facing to a cell classification task and an optimized reference set cell-cell graph are constructed Using adjacency matrix Representing, outputting a predicted reference set cell type tag matrix ; The dynamic diagram structure learning module is composed of a learning soft adjacent matrix Two layers of graph rolling GCN network and combined optimizer, Wherein the method comprises the steps of , For the total number of cell types, encoded by one-hot of the true tags of the reference set of cells, Middle element Representing cells Whether or not it is a category Is 1, otherwise is 0, , , Middle element Representing cells Is of the category Probability scores of (2); Step 3-3-1 learner soft adjacency matrix ; Defining an adjacency matrix as described in step 3-1-2 Learner soft adjacency matrix of the same dimension And is initialized to ; Step 3-3-2, two-layer graph rolling GCN learning; Constructing a two-layer graph convolution GCN network for learning cell classification tasks, and inputting a learnable soft adjacency matrix defined in step 3-3-1 Standardized hypervariable gene expression matrix with node characteristic matrix as reference set Reference set cell type tag matrix output as prediction ; The two-layer graph roll GCN network includes an encoder and a classifier, the formula is as follows: ; Wherein, the Is a hidden feature of the first layer output, i.e. the encoder, Is a classifier of the second layer, outputs as a predicted cell type tag matrix , Is symmetrically normalized and added with self-loop , And Is a trainable weight matrix of the GCN; step 3-3-3, constructing a joint optimizer; the joint optimizer coordinates the whole graph structure learning process and uses the reference set cell type label matrix As a supervisory signal, the learner-driven soft adjacency matrix of step 3-3-1 is jointly trained by an alternating optimization strategy And step 3-3-2, rolling GCN network parameters by the two-layer graph, wherein the joint optimizer is realized by the following joint optimizer loss function: Wherein, the And Is super-parameter, take respectively And , In order to combine the optimizer loss function, Optimizing a loss function for the graph structure; in order to cross-entropy loss function, Is a characteristic smoothing loss function, and the formulas are respectively as follows: Wherein, the Is the Frobenius norm, constraint With initial KNN pattern Is used for the structural consistency of the (c), Is an L1 norm, promotes sparsity of graph structures, Is a core norm, facilitates low rank characteristics of the graph structure, And To balance super parameters, take respectively And ; A Laplace matrix is normalized; step 3-4, constructing a domain adaptation module; The map domain adaptation module adopts PAIRWISE ALIGNMENT method to estimate two groups of self-adaptive weight alignment reference sets and query sets through iteration, wherein the map domain adaptation module comprises a query set cell type prediction label matrix generation, a condition structure alignment, a label distribution alignment and domain adaptation training control block; Query set cell type predictive tag matrix generation for generating a query set initial cell-cell map Is a predictive query set cell type tag matrix ; Reference cell-cell map after optimization Predicted query set cell type tag matrix And query set initial cell-cell map Reference set cell type tag matrix Input into conditional structure alignment, calculate density ratio matrix ; Labeling matrix of reference set cell types Predicted reference set cell type tag matrix Predicted query set cell type tag matrix Computing tag weight vectors for input to tag distribution alignment block ; Domain adaptation training control block reception Performing weighting training, and updating and optimizing the two-layer graph rolling GCN network parameters in the step 1-2 until the model converges; Step 3-4-1, generating a query set cell type prediction tag matrix; initial cell-cell map of query set Using adjacency matrix The representation is input into the two-layer graph rolling GCN network of the step 3-3, and the hypervariable gene expression matrix is standardized by a query set And finally outputting a predicted query set cell type label matrix for the node characteristic matrix ; Step 3-4-2, aligning the conditional structure; To reference the cell-cell map Predicted query set cell type tag matrix And query set initial cell-cell map Reference set cell type tag matrix For input, calculate density ratio matrix The element definition formula is as follows: Wherein, the Is expressed in a given type When the cell of (a) is in the neighborhood of type Is the difference in probability between the two data sets, And Representing nodes respectively And Is selected from the group consisting of a cell type, Representing nodes Is defined by a set of neighboring nodes of the network, And Representing probabilities under the reference set and query set data distributions respectively, Representing a set of cell type tags; decomposing the conditional probability ratio into the joint distribution weights of the intermediate weight edge types And node type edge distribution weights Calculated by solving the following system through an iterative algorithm to estimate : Wherein, the And The calculation formula is as follows , . Wherein the method comprises the steps of Is to measure the connection as Class and class The ratio of the occurrence ratio of the edges of the class nodes in the edge set in the query set to the occurrence ratio in the reference set; Is of the category of The ratio of the occurrence ratio of the nodes in the query set to the occurrence ratio in the reference set; Representing connection nodes And Is provided with a pair of side edges, Edge sets representing a reference set and a query set, respectively; Wherein, the The method comprises the following steps of solving a constraint least square problem: Wherein, the And Is the statistic estimated from the edge level predictions of the reference set and the query set, and the calculation formula is as follows: Wherein, the To represent edges from the reference set The predicted cell type label of the start and end points of (c), Representing edges from the reference set respectively The true cell type tags of the start and end points of (c), Nodes representing classifier outputs Belonging to the category of Is used to determine the prediction probability of (1), Nodes representing classifier outputs Belonging to the category of Is used to determine the prediction probability of (1), Nodes representing classifier outputs Belonging to the category of Is used to determine the prediction probability of (1), Nodes representing classifier outputs Belonging to the category of Is used for predicting the probability of (1); Wherein, given After that, the processing unit is configured to, Calculated from the following formula: Step 3-4-3, label distribution alignment; labeling matrix with reference set cell types Predicted reference set cell type tag matrix Predicted query set cell type tag matrix For input, calculate tag distribution weight vector The element definition formula is as follows: Obtained by solving the following linear system with constraints: Wherein, the Is the empirical confusion matrix calculated by the reference set classifier on the reference set data, The average prediction output of the classifier on the whole target domain data is calculated by the following formula: Wherein, the And Respectively representing the total number of reference set cells and query set cells; step 3-4-4, domain adaptation training control block; The module coordinates the whole domain adaptation training process, integrates the results of conditional structure alignment and label distribution alignment, and realizes effective domain adaptation through weighted training; this step will use the current Re-weighted reference set cell-cell map (With adjacency matrix) Representation) to obtain a new adjacency matrix And then new adjacent matrix And reference set normalized hypervariable gene expression matrix Inputting the new predicted reference set cell type label matrix into the two-layer graph convolution GCN network in the step 3-3 Reuse of Loss weighting is carried out on the reference set training samples, and two-layer graph convolution GCN network parameters are updated through a gradient descent method; Repeating the steps 3-4, and collecting initial cell-cell diagram And query set standardized hypervariable gene expression matrix Performing two-layer graph rolling GCN network input to parameter updating to obtain new predicted query set cell type label matrix Finally, combining new adjacency matrix Novel predicted reference set cell type tag matrix Novel predicted query set cell type tag matrix Updating And Iterative training is carried out until the model converges; wherein use is made of For post-optimization reference cell-collection-cell map Using adjacency matrix Representing the re-weighting, the new weight calculation formula is as follows: Wherein, the Is a node And The predicted category of (1) is the category with the highest probability; wherein use is made of Loss weighting is carried out on the training samples of the reference set, and weighted cross entropy loss is calculated Updating model parameters by using a gradient descent method, and performing end-to-end domain adaptation training on the two-layer graph convolution GCN network parameters in the step 1-2, wherein the weighted cross entropy loss function has the following calculation formula: Wherein, the Representing nodes A kind of electronic device Prediction probability of individual classes, reference set cell type tag matrix for new predictions Is used for the row vectors of (a), Expressed in a predictive probability vector In the corresponding node Probability values for the true categories; And 4, constructing a total loss function to cooperatively optimize two learning tasks defined in the step 3, namely graph structure learning and domain adaptation training, wherein the mathematical expression of the total loss function is as follows: in the formula, In order to unify the joint loss function, Already mentioned in step 3; step 5, inputting the reference data set and the query data set divided in the step 2 into the single-cell classification network constructed in the step 3 and based on graph structure learning and graph domain adaptation, and performing model training by using the total loss function minimization in the step 4; And 6, inputting the new query cell set into the trained model in the step 5, and obtaining the predicted cell type classification.
- 2. An electronic device comprising a processor and a memory, the memory for storing a computer program, the processor for executing the computer program stored in the memory to cause the electronic device to perform the method of claim 1.
- 3. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to claim 1.
- 4. A chip comprising a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method of claim 1.
- 5. A computer program product comprising a computer storage medium storing a computer program comprising instructions executable by at least one processor, the instructions when executed by the at least one processor implementing the method of claim 1.
Description
Domain-adaptive single cell classification method capable of learning map Technical Field The invention belongs to the technical field of biological information, and particularly relates to a domain-adaptive single-cell classification method capable of learning a graph. Background In the field of single cell RNA sequencing data analysis, single cell classification is a key step in understanding cell heterogeneity and function. In recent years, the method based on the graph neural network remarkably improves the cell classification performance by utilizing the association relation among cells. However, existing GNN methods have significant limitations in both the way the graph structure is constructed and the cross-domain generalization capability of the model when processing cross-sample, cross-patient, non-spatial scRNA-seq data. Cell-cell map construction in the prior art relies mainly on two paradigms, none of which enables end-to-end optimization of the map structure: (1) A static diagram construction method based on similarity; scGCN is representative of a typical technical scheme for such methods. The method comprises the steps of firstly calculating a similarity matrix (such as a pearson correlation coefficient) of gene expression profiles among cells, and then carrying out graph sparsification by setting a threshold or retaining K neighbors to form a fixed and static adjacency matrix A. This matrix is used as input to GNN for subsequent cell type prediction. However, the graphs constructed by such methods are entirely dependent on gene expression similarity, the quality of which is limited by the high dimensional noise and dropout events of the raw data. The high similarity of expression is not exactly equivalent to a biologically significant interaction, resulting in a large number of unrelated and even misleading links contained in the structure of the map. More importantly, once the graph is built, the graph is fixed, and cannot be adaptively adjusted and optimized according to the requirements of downstream classification tasks in the model training process, so that the upper limit of the model performance is limited. (2) The static diagram construction method based on external knowledge weighting; scGNN and scPML are part of improvements to (1) the similarity-based static diagram construction method. Wherein scGNN introduces a ligand-receptor interaction probability score obtained from CellChat database, which is used as a priori knowledge to give weight to the sides between cells, scPML uses AUCell algorithm to score gene signal paths in KEGG database, and uses the score to calculate the strength of the correlation between cells as the weight of the sides. While introducing biological knowledge, the external databases (e.g., LRIs, signal pathways) on which it depends are incomplete and non-tumor specific, may miss critical interactions or introduce irrelevant background noise in a particular tumor microenvironment. Meanwhile, as with the first type of method, the graph structure (including edges and weights) is determined before model training, and is still a static and fixed graph in nature, and cannot be learned and corrected according to specific data sets and tasks. The method breaks the graph construction and model learning, and fails to realize integrated optimization. When the trained model is applied to query data sets from different patients, different experimental batches, serious domain offset problems, i.e., inconsistent data distributions of the source domain (reference set) and the target domain (query set), are faced. The aforementioned scGCN, scGNN, etc. methods, as well as other non-graph structured classifiers such as SingleR and scLearn, perform well when training and testing data distribution is consistent, but their model design does not explicitly take domain offset into account. These methods generally assume that the reference set and the query set follow the same data distribution. When applied to new samples, they were predicted directly using a model trained on a reference set, without any compensation or alignment of differences between the two domains in terms of cell composition, gene expression level, technical batch, etc. However, this ignoring domain offsets results in an insufficient model generalization capability. Model performance may be significantly degraded when there is a large biological or technical difference between the query sample and the reference sample. The method is characterized in that a model with excellent performance on a source domain is directly applied to target domain patient data, indexes such as accuracy rate, recall rate and the like of cell type annotation can be greatly dropped, and the practicability and robustness of the model on a real clinical heterogeneous data set are seriously affected. In general, the prior art methods suffer from the following drawbacks: (1) The graph structure is static-existing GNN methods rely on pred