CN-121545583-B - Integrated analysis method and system for multi-source single cell transcriptome data

CN121545583BCN 121545583 BCN121545583 BCN 121545583BCN-121545583-B

Abstract

The invention relates to the technical field of data analysis, in particular to an integrated analysis method and system for multi-source single-cell transcriptome data. The method comprises the steps of converting all gene names of single cell transcriptome data from different sources into a unified version, filtering source specific genes, constructing a specific background gene pool, carrying out pollution gene identification, eliminating background RNA pollution, calculating average sequencing depth of each sample cell, carrying out sequencing depth correction on each sample cell, carrying out cell type identification through cell grouping and characteristic gene expression analysis based on the single cell transcriptome data from different sources to obtain each cell type, constructing anchor cells aiming at each cell type, and correcting the depth balanced cell expression profile of the same cell type to obtain an integrated cell expression profile. The accuracy of single cell transcriptome data integration analysis is improved, and the influence of batch effect caused by multiple sources on subsequent other analysis is reduced.

Inventors

SONG KANGLI
CHEN JIANXIANG
DONG HENG
CHEN ZHEMING

Assignees

杭州师范大学

Dates

Publication Date: 20260508
Application Date: 20260119

Claims (9)

1. A method for integrated analysis of multi-source single cell transcriptome data, the method comprising: Counting all gene names of single cell transcriptome data of different sources, converting each gene name into a unified version, marking genes expressed in a single source as source specific genes, and filtering to obtain an initial cell expression profile of a gene set; Constructing a specific background gene pool, identifying pollution genes of each sample cell in the initial cell expression profile through an expression proportion threshold value and an average expression quantity threshold value, and eliminating background RNA pollution according to an identification result to obtain a cell purified expression profile; calculating the average sequencing depth of each sample cell based on the expression profile after cell purification, and correcting the sequencing depth of each sample cell to obtain a depth-balanced cell expression profile after correcting the depth; Cell type identification is carried out by cell grouping and characteristic gene expression analysis based on single cell transcriptome data of different sources, so as to obtain each cell type; The method comprises the steps of constructing anchor cells aiming at each cell type, classifying cells of the same source and the same cell type by using a k-means algorithm, randomly extracting seed cells from the classified cells, determining similar cells corresponding to each seed cell, combining expression profiles of the similar cells in an average value mode to generate anchor cells, calculating the mass center of the expression profile of the depth-balanced cells and a source correction factor based on the anchor cells, and correcting the expression profile of the depth-balanced cells of the same cell type to obtain an integrated cell expression profile.
2. The method of claim 1, wherein converting each of the gene names into a unified version, and wherein the genes expressed in only a single source are designated as source-specific genes and filtered, comprises: All gene names of the single cell transcriptome data from different sources are converted into a unified version based on the stable IDs and gene names of the NCBI and Ensembl reference genomes; the gene expression amounts of all cells within a single source were combined, the expressed genes of the respective sources were counted, and genes expressed in only a single source were designated as source-specific genes and deleted.
3. The method of claim 1, wherein constructing a pool of specific background genes, and performing contaminant gene identification on each sample cell in the initial cell expression profile by an expression proportion threshold and an average expression quantity threshold, comprises: Constructing a specific background gene pool, performing unsupervised clustering on single sample cells, and calculating the expression proportion and average expression quantity of each gene in the specific background gene pool in each cell group; genes whose expression ratios exceeded the expression ratio threshold and whose average expression amounts in each group exceeded the average expression amount threshold were determined to be contaminating genes.
4. The method of claim 3, wherein the step of eliminating background RNA contamination based on the identification results to obtain a post-cell purification expression profile comprises: taking the minimum expression quantity of the pollution genes in each cell population as a pollution expression quantity, and removing the pollution expression quantity; And carrying out pollution gene judgment and removal on all sample cells in the initial cell expression profile to obtain a cell purified expression profile.
5. The integrated analysis method of multi-source single cell transcriptome data according to claim 1, wherein calculating an average sequencing depth for each sample cell based on the post-cell-decontamination expression profile comprises: determining the sum of the transcript numbers of all cells of the sample and the cell number in the sample based on the post-cell purging expression profile; and calculating the average sequencing depth of each sample cell according to the total transcript number of all cells in the sample and the cell number in the sample.
6. The method of claim 5, wherein the step of performing depth correction of sequencing each sample cell to obtain a depth-balanced cell expression profile after depth correction comprises: Comparing the average sequencing depth of each sample cell, selecting the minimum sequencing depth from each average sequencing depth as a reference, and calculating the maximum correction depth; If the average sequencing depth of the sample cells is larger than the maximum correction depth, calculating the number of transcripts to be reduced by the sample cells and reducing the gene expression quantity, and finishing the sequencing depth correction to obtain a depth balanced cell expression profile after the correction depth.
7. The method of claim 1, wherein calculating the depth balanced cell expression profile centroid and the source correction factor based on the anchor cells corrects the depth balanced cell expression profile of the same cell type, comprising: Calculating the centroid of the expression profile corresponding to the cell type based on anchor cells of all the same cell type; calculating the difference value of the expression profile of each source anchor point cell and the centroid, calculating the average value of each expression profile difference value, and taking the average value as a source correction factor; correction is accomplished by subtracting the source correction factor from the expression profile of the cell type.
8. An integrated analysis system for multi-source single cell transcriptome data, the system comprising: The name unifying and gene filtering module is used for counting all gene names of single cell transcriptome data from different sources, converting each gene name into a unifying version, marking genes expressed in a single source as source specific genes, and filtering to obtain an initial cell expression profile of a gene set; The pollution elimination module is used for constructing a specific background gene pool, carrying out pollution gene identification on each sample cell in the initial cell expression profile through an expression proportion threshold value and an average expression quantity threshold value, and eliminating background RNA pollution according to an identification result to obtain a cell purified expression profile; The difference correction module is used for calculating the average sequencing depth of each sample cell based on the expression profile after cell purification, and correcting the sequencing depth of each sample cell to obtain a depth balanced cell expression profile after the correction depth; The type identification module is used for carrying out cell type identification through cell grouping and characteristic gene expression quantity analysis based on single cell transcriptome data of different sources to obtain each cell type; The cell expression profile correction integration module is used for constructing anchor cells aiming at each cell type and comprises the steps of classifying cells of the same source and the same cell type by using a k-means algorithm, randomly extracting seed cells from the classified cells, determining similar cells corresponding to each seed cell, combining the expression profiles of the similar cells in an average value mode to generate the anchor cells, calculating the depth-balanced cell expression profile centroid and the source correction factors based on the anchor cells, and correcting the depth-balanced cell expression profile of the same cell type to obtain the integrated cell expression profile.
9. The integrated analysis system of multi-source single cell transcriptome data according to claim 8, wherein the name unifying and gene filtering module is further configured to convert all gene names of the single cell transcriptome data of different sources into a unified version based on the stable IDs and gene names of NCBI and Ensembl reference genomes, to combine gene expression amounts of all cells within a single source, to count expressed genes of each source, and to record genes expressed in only a single source as source-specific genes and to delete the genes.

Description

Integrated analysis method and system for multi-source single cell transcriptome data Technical Field The invention relates to the technical field of data analysis, in particular to an integrated analysis method and system for multi-source single-cell transcriptome data. Background The single cell transcriptome sequencing technology (scRNA-seq) breaks through the limitation that the single cell transcription heterogeneity cannot be distinguished in the traditional whole RNA sequencing (bulk RNA-seq), can accurately capture gene expression characteristics under single cell resolution, and provides powerful data support for cell subgroup classification, rare cell type identification, cell state dynamic change analysis and intercellular regulation and control network analysis. Single cell transcriptome sequencing technology has been widely used in various fields such as medicine, biology, agriculture, etc., and covers the research of various tissues and organs, model organisms and crops of human body. With the rapid popularization and popularization of single-cell transcriptome sequencing technology, domestic and foreign scientific research teams accumulate massive multi-source single-cell transcriptome data. The integrated analysis of the multi-source single cell transcriptome data has important scientific research value and practical application significance. On one hand, the method has the advantages of high difficulty in acquiring a plurality of biological samples, high cost, effective reduction of research threshold by integrating sample data of different channels, reduction of resource waste caused by repeated experiments, and higher cost of single-cell transcriptome sequencing experiments and data analysis, and can maximize the data utilization rate by integrating multi-source data, so that a more comprehensive and reliable biological rule behind the data is mined, and the method provides possibility for cross-research, cross-species and cross-platform comparative analysis. However, integrated analysis of multi-source single-cell transcriptome data faces a series of problems such as background RNA contamination interference, reference genome and gene annotation differences, non-uniform sequencing depth, significant batch effects, etc. The method comprises the steps of carrying out single cell separation, library establishment and other experimental processes, wherein a large amount of free RNA is easy to generate, high-level background RNA pollution is formed, gene expression signals are distorted to interfere with identification of real biological characteristics, genome sequences of different versions are different, gene names, gene numbers and gene function notes are possibly changed or alias differences exist, the same genes cannot be accurately matched in different source data, sequencing flux of different sequencing platforms, technical parameter settings of experimental batches, RNA quality of samples and other factors can cause significant differences in sequencing depth of different source data, and batch effects caused by non-biological factors can be introduced by the factors of different experimental batches, different sequencing platforms, different sample processing conditions and the like, so that the data comprise technical variations irrelevant to research targets. Therefore, the conventional integration analysis of the multi-source single-cell transcriptome data often causes the problems of low accuracy and poor quality of the integration of the multi-source single-cell transcriptome data due to background RNA pollution, difference of reference genome and gene annotation, non-uniform sequencing depth, remarkable batch effect and the like. Disclosure of Invention Based on the above, in order to solve the above technical problems, an integrated analysis method and system for multi-source single-cell transcriptome data are provided, which can eliminate the technical variation in the multi-source data and improve the quality and accuracy of data integration. A method of integrated analysis of multi-source single cell transcriptome data, the method comprising: Counting all gene names of single cell transcriptome data of different sources, converting each gene name into a unified version, marking genes expressed in a single source as source specific genes, and filtering to obtain an initial cell expression profile of a gene set; Constructing a specific background gene pool, identifying pollution genes of each sample cell in the initial cell expression profile through an expression proportion threshold value and an average expression quantity threshold value, and eliminating background RNA pollution according to an identification result to obtain a cell purified expression profile; calculating the average sequencing depth of each sample cell based on the expression profile after cell purification, and correcting the sequencing depth of each sample cell to obtain a depth-balanced cell expressi