CN-121999887-A - Cell type annotation method based on multi-feature contrast learning

CN121999887ACN 121999887 ACN121999887 ACN 121999887ACN-121999887-A

Abstract

The invention relates to the technical field of cell type annotation, in particular to a cell type annotation method based on multi-feature contrast learning, which comprises the steps of obtaining a gene expression matrix, extracting a gene name, extracting a gene base sequence from a transcriptome sequence database based on the gene name, and matching the gene base sequence with a motif sequence to obtain a gene motif matching matrix; multiplying the gene expression matrix and the gene motif matching matrix to obtain a cell motif matrix, preprocessing the gene expression matrix and the cell motif matrix, constructing a data set based on the preprocessed gene expression matrix and the preprocessed cell motif matrix, training a dual-feature coding model by using a training set and a verification set, wherein the dual-feature coding model comprises a gene expression encoder, a motif feature encoder and a classifier, and inputting a test set into the trained dual-feature coding model to obtain annotated cell types. The invention makes the annotation result have more biological significance and is helpful for downstream functional analysis.

Inventors

JIANG LU
LI YUANJIE
LU HAI

Assignees

大连海事大学

Dates

Publication Date: 20260508
Application Date: 20260107

Claims (8)

1. A cell type annotation method based on multi-feature contrast learning, comprising the steps of: obtaining a gene expression matrix, extracting a gene name, extracting a gene base sequence from a transcriptome sequence database based on the gene name, and matching the gene base sequence with a motif sequence to obtain a gene motif matching matrix; multiplying the gene expression matrix by the gene motif matching matrix to obtain a cell motif matrix; Preprocessing the gene expression matrix and the cell motif matrix, constructing a data set based on the preprocessed gene expression matrix and the preprocessed cell motif matrix, and dividing the data set into a training set, a testing set and a verification set; training a dual-feature coding model by using the training set and the verification set, wherein the dual-feature coding model comprises a gene expression encoder, a motif feature encoder and a classifier; and inputting the test set into a trained dual-feature coding model to obtain annotated cell types.
2. The method for annotating a cell type based on multi-feature contrast learning according to claim 1, wherein the matching of the gene base sequence with the motif sequence to obtain a gene motif matching matrix comprises: Inputting the gene base sequence and the base sequence database file into a sequence matching tool for matching analysis to obtain a matching result file; Counting the occurrence times of each motif in the base sequence of each gene, and constructing a gene motif matching matrix based on the occurrence times of each motif.
3. The method for annotating a cell type based on multi-feature contrast learning of claim 1, wherein the preprocessing of the gene expression matrix comprises: filtering cells with mitochondrial gene expression levels in the gene expression matrix exceeding a first preset threshold of total cell expression levels, filtering cells with detected gene numbers in the gene expression matrix lower than a second preset threshold, and filtering genes expressed in cells with less than a third preset number in the gene expression matrix to obtain a filtered gene expression matrix; normalizing the filtered gene expression matrix, and selecting the hypervariable genes in the normalized gene expression matrix as the pretreated gene expression matrix.
4. The method of claim 3, wherein the first predetermined threshold is 5%, the second predetermined threshold is 200, and the third predetermined number is 3.
5. The method for annotating a cell type based on multi-feature contrast learning of claim 1, wherein the preprocessing of the cell motif matrix comprises: screening a cell motif matrix, so that the cell motif matrix and a gene expression matrix comprise the same cell collection; Normalizing the cell motif matrix after screening, and scaling the numerical value in the cell motif matrix after screening to the interval of [0,1] to obtain the cell motif matrix after processing.
6. The multi-feature contrast learning based cell type annotation method of claim 1, wherein the workflow of the dual feature coding model comprises: Inputting the pretreated gene expression matrix into a gene encoder to obtain gene characteristics; Inputting the pretreated cell motif matrix into a motif encoder to obtain motif characteristics; and splicing the gene characteristics and the motif characteristics, and inputting the spliced gene characteristics and motif characteristics into a classifier to obtain a cell type prediction result.
7. The cell type annotation method based on multi-feature contrast learning as claimed in claim 1, wherein the training process of the dual-feature coding model is optimized by adopting a mixed loss function, and the calculation formula of the mixed loss function is as follows: Wherein, the In order to mix the loss function, In order to contrast the loss of the optical fiber, In order to compare the weight corresponding to the loss, In order to achieve a loss of alignment, In order to align the weights corresponding to the loss, In order to classify the loss of the device, The corresponding weights are lost for classification.
8. The method for annotating cell types based on multi-feature contrast learning of claim 7, wherein the contrast loss is calculated based on a sample, the contrast loss is used for pulling up a similar sample and pushing away a heterogeneous sample, and a calculation formula of the contrast loss is: Wherein, the In the case of a positive number of samples, To be an indicator function, the value is 1 when samples i and j belong to the same cell type, and 0 otherwise, Representing the euclidean distance between sample i and sample j in the embedding space, Is the number of negative pairs of samples, Is a boundary parameter, is used for defining a minimum distance threshold value which is correspondingly maintained between the negative sample pairs, The alignment loss is used for minimizing the difference between the gene characteristics and motif embedding of the same cell, and the calculation formula of the alignment loss is as follows: Wherein, the For gene insertion in the ith sample, Average gene insertion for all samples of the cell type corresponding to the ith sample, For motif embedding of the ith sample, The average motif embedding for all samples of the cell type corresponding to the ith sample, As a total number of samples, The classification loss is used for calculating cross entropy loss based on the real cell type label, and the calculation formula of the classification loss is as follows: Wherein, the To represent the encoded authentic signature of cell i, the value is 1 only if the cell belongs to type k, The probability that cell i belongs to type k is predicted for the corresponding model.

Description

Cell type annotation method based on multi-feature contrast learning Technical Field The invention relates to the technical field of cell type annotation, in particular to a cell type annotation method based on multi-feature contrast learning. Background The rapid development of single cell transcriptome sequencing (scRNA-seq) technology provides an important tool for resolving cellular heterogeneity in complex tissues. Cell type annotation is used as a key element in scRNA-seq data analysis, essentially to classify cells into known biological types or states by quantifying intracellular gene expression patterns. This process relies on the degree of matching of gene expression signatures to an established cell type reference database, the accuracy of which directly affects the reliability of downstream analysis results. In recent years, with the progress of epigenomic technology, single-cell multiunit data such as chromatin accessibility sequencing (scATAC-seq) provides a new dimension for cell identity identification, wherein the motif characteristic in the DNA sequence is used as a functional marker of a transcription factor binding site, and key information of gene regulation is contained. The existing cell type annotation methods are mainly based on gene expression characteristics of single histology data (such as scRNA-seq), and it is difficult to comprehensively capture the complexity of cell identity. Although scJoint et al tried to integrate transcriptome and apparent group data, its technical route was limited to converting the peak matrix of scATAC-seq to a gene activity score matrix, achieving histologic alignment by sharing potential space. Although this way of treatment improves the recognition rate of rare cell types, the regulatory information carried by motif features cannot be effectively utilized. Because the conservation of motif in DNA sequences directly reflects the binding potential of transcription factors, the neglect of motif in the prior art leads to the loss of key regulatory layer information when judging the cell state, and particularly, deviation is easy to occur in cell subgroup annotation with obvious difference of transcription factor activity. Disclosure of Invention According to the technical problems that the incomplete cell type annotation information and the insufficient accuracy in a fine typing scene depending on the activity of transcription factors are caused by the fact that the gene expression characteristics and the DNA sequence motif regulation characteristics are not integrated effectively are provided, the cell type annotation method based on multi-characteristic comparison learning is provided. The invention mainly utilizes sequence matching to analyze single-cell gene expression data, constructs a cell motif matrix reflecting the binding potential of transcription factors, and designs a double-flow comparison learning model containing a gene expression encoder and a motif feature encoder, thereby fully utilizing the complementary information between the gene expression quantity feature and the gene upstream regulatory sequence feature, aligning and enhancing cell characterization in a unified low-dimensional embedding space, and remarkably improving the accuracy and robustness of cell type annotation. The invention adopts the following technical means: a cell type annotation method based on multi-feature contrast learning, comprising the steps of: obtaining a gene expression matrix, extracting a gene name, extracting a gene base sequence from a transcriptome sequence database based on the gene name, and matching the gene base sequence with a motif sequence to obtain a gene motif matching matrix; multiplying the gene expression matrix by the gene motif matching matrix to obtain a cell motif matrix; Preprocessing the gene expression matrix and the cell motif matrix, constructing a data set based on the preprocessed gene expression matrix and the preprocessed cell motif matrix, and dividing the data set into a training set, a testing set and a verification set; training a dual-feature coding model by using the training set and the verification set, wherein the dual-feature coding model comprises a gene expression encoder, a motif feature encoder and a classifier; and inputting the test set into a trained dual-feature coding model to obtain annotated cell types. Further, the matching of the gene base sequence and the motif sequence to obtain a gene motif matching matrix comprises: Inputting the gene base sequence and the base sequence database file into a sequence matching tool for matching analysis to obtain a matching result file; Counting the occurrence times of each motif in the base sequence of each gene, and constructing a gene motif matching matrix based on the occurrence times of each motif. Further, the pretreatment process of the gene expression matrix comprises the following steps: filtering cells with mitochondrial gene expression levels in the gene