CN-121983137-A - Accurate identification method for benign and malignant cells based on multi-dimensional characteristics of single cell transcriptome
Abstract
The invention discloses a precise identification method of benign and malignant cells based on multi-dimensional characteristics of a single cell transcriptome, and belongs to the technical fields of bioinformatics, tumor molecular biology and cell identification. Based on single cell transcriptome sequencing data of tumor tissues and paracancerous tissues, combining three types of information including copy number variation, allele specific copy number variation and tumor related transcription characteristics, and carrying out final benign and malignant identification on each EpCAM positive epithelial cell through a multi-evidence voting strategy. The method has higher stability and accuracy under the scenes of multiple patients, multiple samples and early tumors, particularly improves the identification capability of malignant cells in early lesions and benign and malignant boundary transition state cells, and provides a reliable technical means for accurate diagnosis and personalized treatment of tumors.
Inventors
- XING XUDONG
- ZHANG YUZHUO
- LIU YUKUAN
- LI XIZE
Assignees
- 中国科学院北京基因组研究所(国家生物信息中心)
Dates
- Publication Date
- 20260505
- Application Date
- 20260121
Claims (10)
- 1. A method for precisely identifying benign and malignant cells based on multi-dimensional characteristics of a single cell transcriptome comprises the following steps: 1) Single cell transcriptome data preprocessing, namely carrying out normalization and hypervariable gene screening on single cell transcriptome data of a target tumor sample, combining principal component analysis and UMAP dimension reduction, dividing cells into different clusters by using a clustering algorithm, carrying out cell type annotation based on known marker genes, mainly dividing the cells into EpCAM positive epithelial cells, mesenchymal cells and immune cells, screening the EpCAM positive epithelial cells as candidate cell sets for judging benign and malignant diseases, and taking the immune cells and the mesenchymal cells as normal reference cell populations deduced by subsequent copy number variation; 2) Carrying out copy number variation analysis on candidate cells of a target tumor sample through inferCNV algorithm, using immune cells as normal reference cell population, deducing copy number variation condition of the candidate cells, determining a sample specific CNV judgment threshold value based on the average CNV level of mesenchymal cell clusters, and marking cells with the average CNV level exceeding the CNV judgment threshold value as malignant cells; 3) The malignant cell deducing based on the tumor related transcription characteristic score comprises the steps of constructing a malignant and non-malignant characteristic gene set based on the tissue block transcriptome expression difference of tumor tissues and paracancerous normal tissues, calculating the malignant score of each cell by using a Seurat tool, and finally obtaining the judgment of malignant and non-malignant cells by repeatedly updating the characteristic gene set in multiple rounds, wherein the termination condition of repeatedly updating the characteristic gene set in multiple rounds is that the consistency ratio of the current round to the last round of characteristic gene set reaches or exceeds a preset threshold value or the number of iterative rounds reaches a preset upper limit; 4) Extracting allele information of a target tumor sample by utilizing Numbat tools to carry out copy number variation inference, setting key parameters for modeling, and judging malignant cells according to tumor probability, genotype state and clone tag information; 5) Comprehensively judging the final benign and malignant states, namely carrying out final benign and malignant states judgment on the EpCAM positive epithelial cells by adopting voting rules in combination with inferred results based on copy number variation, allele specific copy number variation and tumor related transcription characteristic scores.
- 2. The method of claim 1, wherein in step 1) single cell transcriptome sequencing data of tumor tissue and/or paracancerous normal tissue samples is used to construct Seurat objects containing UMI count matrix, normalized and hypervariable gene screening is performed, cells are divided into a plurality of cell clusters by combining principal component analysis, UMAP dimension reduction method and clustering algorithm, each cell cluster is subjected to type annotation based on known marker genes, cells are divided into at least EpCAM positive epithelial cells, mesenchymal cells and immune cells including B lymphocytes, T lymphocytes, natural killer cells and myeloid cells, and a candidate cell set to be judged as benign and malignant is constructed by screening all cells and only retaining EpCAM positive epithelial cells according to cell type annotation results, and the immune cells are used as normal reference cell populations in subsequent copy number variation inference.
- 3. The method of claim 2, wherein step 2) extracts whole-gene UMI count matrix and cell type annotation information from the Seurat object, constructs a gene sequence file containing gene chromosome position and gene sequence in combination with a reference genome annotation file, sorts the whole-gene UMI count matrix and cell type annotation information into a cell annotation file and an expression matrix file required by inferCNV algorithm, designates immune cells in advance as normal reference cells in the cell type annotation, constructs inferCNV object and excludes sex chromosome and mitochondrial chromosome, runs inferCNV to perform low-expression gene filtration, hidden markov model CNV state prediction and Leiden graph clustering by adopting preset parameters, estimates CNV level of each candidate cell on each chromosome segment, writes inferCNV output results into Seurat object metadata, extracts CNV index columns starting with pro_scaled_cnv_chr, calculates average CNV level of each cell on each chromosome, and determines that the average CNV cluster has a specific CNV threshold value exceeding the average threshold value on the basis of the average global CNV initial population of mesenchymal cell, and determines that the average CNV cluster has a specific CNV threshold value exceeding the average CNV threshold value.
- 4. The method of claim 3, wherein in the step 2), the global initial threshold is an average CNV level of 0.10, the sample-specific CNV determination threshold is adjusted by observing the average CNV level distribution of each mesenchymal cell cluster in the same sample, and the analysis procedure is repeated inferCNV for the single-cell transcriptome data of the paracancerous normal tissue for the sample with the paired paracancerous normal tissue, and for the EpCAM-positive cell clusters in which the average CNV level is higher than the CNV determination threshold, the EpCAM-positive cell clusters are regarded as false-positive malignant cell clusters and are removed from the malignant cell clusters to obtain the malignant epithelial cell clusters corrected by the paracancerous normal tissue.
- 5. The method for precisely identifying benign and malignant cells according to claim 2, wherein the step 3) is based on differential expression analysis of tissue block transcriptome data of tumor tissue and paracancerous normal tissue, and the P value is smaller than a predetermined threshold after the adjustment On the premise that the absolute value of the multiple change exceeds a preset threshold value, the differential genes are subjected to the following steps The multiple change is ordered from high to low, a malignant characteristic gene set is constructed by selecting a plurality of high-expression genes from tumor tissues, a non-malignant characteristic gene set is constructed by selecting a plurality of high-expression genes from normal tissues, in single-cell transcriptome data of step 1), a malignant score and a non-malignant score are calculated for each EpCAM positive epithelial cell by utilizing a AddModuleScore function of Seurat based on the malignant characteristic gene set and the non-malignant characteristic gene set respectively, a benign and malignant difference score is defined by subtracting the malignant score from the non-malignant score, the difference score is normalized to a 0-1 interval, the two classification is carried out on the EpCAM positive epithelial cells by adopting a k-means non-supervision clustering algorithm based on the normalized difference score, an initial malignant epithelial cell and non-malignant epithelial cell set is obtained by taking the malignant epithelial cells obtained by the current round classification as two groups, single-cell level difference expression analysis is carried out on the basis, the malignant characteristic gene set and the non-malignant characteristic gene set are updated, the k-means is calculated repeatedly AddModuleScore, and the consistency ratio of the current round and the previous round classification result is calculated after each round, and the preset iteration ratio reaches the preset iteration threshold value or the iteration termination value is reached when the preset iteration ratio reaches the threshold value or the iteration termination value based on the preset iteration threshold value.
- 6. The method for precisely identifying benign and malignant cells according to claim 5, wherein the parameter setting of the differential expression analysis in step 3) comprises: Differential expression analysis of tissue block transcriptome data with DESeq2, after Benjamini-Hochberg correction, with adjusted P values less than 0.001 and Multiple variation greater than 1 is used as screening threshold value and is as follows The multiple changes are ordered from high to low; Selecting the first 50 high-expression genes from tumor tissues to construct a malignant characteristic gene set, and selecting the first 50 high-expression genes from normal tissues to construct a non-malignant characteristic gene set; when the characteristic gene set is iteratively updated, adopting differential expression analysis of a single cell layer to select the first 50 genes up-regulated in malignant epithelial cells as a new round of malignant characteristic gene sets and the first 50 genes up-regulated in non-malignant epithelial cells as a new round of non-malignant characteristic gene sets; and the iteration process calculates the consistency ratio of the current round and the benign and malignant label of the previous round after each round is finished, and the iteration process is terminated when the consistency ratio reaches or exceeds 95% or the number of iteration rounds reaches 10.
- 7. The precise identification method of benign and malignant cells according to claim 2, wherein step 4) for a target tumor sample, running Numbat pileup_and_phase flow by taking a single cell transcriptome comparison result BAM file, a cell bar code file, a population SNP reference mutation file and a genetic map file as inputs, carrying out allele reading extraction and haplotype phase on the heterozygous SNP sites of the whole genome, and generating an allele count file containing information of each cell reference allele and mutation allele reading; calling a run_ numbat function of Numbat, taking a UMI count matrix of candidate cells to be judged of a tumor sample as a count_mat input, taking a UMI count matrix of immune cells as a ref_internal input, setting key modeling parameters comprising t, gamma, min _cells, multi_slices, min_LLR, max_entropy, call_ clonal _ loh and init_k, constructing a hidden Markov model for jointly evaluating an expression level and an allele level signal, carrying out allele specific copy number variation inference on the tumor sample, obtaining an expression level tumor probability p_cnv_x, an allele level tumor probability p_cnv_y and a joint posterior tumor probability p_cnv of each cell, writing the probabilities into Seurat object metadata, and integrating the distribution of p_cnv_Ep and p_cnv_y on the basis of setting a p_cnv threshold value, and dividing the CAM positive epithelial cells into malignant cells and non-malignant cells.
- 8. The method of claim 7, wherein the key modeling parameters in step 4) include: Parameter tget For balancing the CNV segment resolution with noise robustness; A parameter gamma of 20 is used for describing the excessive degree of dispersion of allele counts; A parameter min_cells fetch 50 for defining the minimum number of cells involved in pseudo bulk modeling; A parameter multi_ allelic set to TRUE for allowing identification of multi-allelic copy number variation events; taking 5 from the parameter min-LLR, and filtering the low confidence CNV event according to the log-likelihood ratio; the parameter max_entropy is initially set to 0.5 for suppressing the CNV state which is too complex or highly uncertain; the parameter call clonal loh is set to TRUE for enabling clonal heterozygous deletion status identification; the parameter init_k is initially set to 3 for specifying the initial clone number; the parameters are regulated within a preset range according to tumor stage, estimated copy number variation load and cloning complexity so as to balance the sensitivity and specificity of CNV detection; The criteria for determining malignant cells include setting a threshold of 0.90 in combination with a posterior tumor probability p_cnv, and determining a certain EpCAM positive epithelial cell as a malignant cell supported by the allele-specific copy number variation module when it satisfies p_cnv >0.90 and exhibits a high tumor probability on both the expression level tumor probability p_cnv_x and the allele level tumor probability p_cnv_y.
- 9. The method of claim 1, wherein in step 5) for each EpCAM positive epithelial cell, the results of the inference based on the copy number variation in step 2), the transcription profile score based on the tumor correlation in step 3), and the allele-specific copy number variation in step 4) are summed up to form a benign/malignant decision vector comprising three pieces of evidence, and when at least two pieces of evidence support the cell as a malignant cell, the cell is finally determined as a malignant cell, and otherwise, the cell is determined as a benign or non-malignant cell.
- 10. A computer program product or a computer readable storage medium, the computer program product comprising a computer program, the computer program being stored on the computer readable storage medium, characterized in that the computer program when executed by a processor implements the steps of the method for accurately identifying benign and malignant cells according to any one of claims 1 to 9.
Description
Accurate identification method for benign and malignant cells based on multi-dimensional characteristics of single cell transcriptome Technical Field The invention relates to the technical fields of single-cell transcriptome, bioinformatics, tumor molecular diagnostics and cell identification, in particular to a method for accurately identifying benign and malignant cells based on multi-dimensional characteristics such as single-cell transcriptome data, comprehensive alleles, copy number variation, gene expression and the like. Background Tumor occurrence and progression are driven by genetic mutation, genomic instability, abnormal apparent regulation, and multiple factors such as microenvironment pressure, and the inside of tumor tissues is usually composed of multiple cell types such as tumor cells, reactive epithelial cells, interstitial cells, and various immune cells. Different cell populations have significant differences in proliferation activity, degree of genomic instability, metabolic status, invasive metastatic capacity, etc., wherein normal or reactive cells of tumor cells and their tissue of origin often exhibit a continuum in morphology and molecular characteristics. In particular, in the early stages of tumorigenesis, the degree of deviation of the genomic abnormal load and transcriptional characteristics of malignant cells is relatively limited, and is highly similar to that of normal or reactive cells in expression pattern, making it more difficult to distinguish early malignant cells from benign cells depending only on conventional indicators. How to accurately distinguish benign cells from malignant tumor cells in a complex tissue background is a key basis for tumor typing, efficacy evaluation and targeted therapy strategy formulation. Traditional benign and malignant determination mainly depends on histopathological morphological observation and detection of a limited number of immunohistochemical markers, and is comprehensively determined by combining partial driving gene mutation or copy number variation and other molecular detection results. The method is limited by the experience of observers, the difference of material taking parts, the limited coverage range of markers and other factors, and is easy to judge inconsistence or difficult to give clear conclusions for cell populations with fuzzy boundaries, close morphology or affected by treatment. Somatic mutation and copy number variation analysis based on whole tissue samples, while reflecting the overall genomic abnormal burden of tumors, cannot resolve the benign and malignant attribution of different cell populations at the single cell level. With the rapid development of single-cell transcriptome sequencing (single-cellRNAsequencing, scRNA-seq) technology, gene expression profiles of various cells in tumor tissues can be obtained at single-cell resolution, so that there is an opportunity to directly identify malignant tumor cells at the expression level and distinguish the malignant tumor cells from normal or reactive cells. Existing single cell transcriptome-based tumor cell identification strategies mainly include scoring expression using known tumor-associated marker genes or gene sets, identifying cells with significant chromosomal instability according to inferred copy number variation patterns (copynumbervariation, CNV), and attempting indirect inference of mutation or allelic imbalance through single cell transcriptome data in part of the study. In samples with early stage of tumorigenesis or lower tumor cell proportion, the copy number variation load of malignant clones is usually lower, the change range is limited, even if the copy number variation is inferred at the single cell transcriptome level, the signal is not obvious, and the molecular characteristics of a small number of early malignant cells are more difficult to accurately identify according to the copy number variation load. The above-described methods still rely primarily on single or few feature dimensions overall, and their stability and generalization capability remain limited in multi-patient, multi-sample, multi-platform data environments. On the one hand, the strategy of relying on marker genes or expression signatures is easily influenced by the functional states, differentiation degrees and microenvironment stimulation of tumor cells, and benign cells in some proliferation active or stress states may also show similar expression patterns, so that false positive judgment is caused, while malignant cells with little change of partial genome or in resting state may be misjudged as benign cells, and the problem is more common in early lesions or mild atypical hyperplasia. On the other hand, analysis methods that infer copy number variation based on transcriptome data are highly sensitive to sequencing depth, gene coverage, and reference cell selection, and the inferred results are often not stable enough for tumor types with low copy number variation l