CN-121983133-A - Single-cell level dynamic gene regulation network inference method and system based on graphic neural network and contrast learning

CN121983133ACN 121983133 ACN121983133 ACN 121983133ACN-121983133-A

Abstract

The invention provides a single-cell level dynamic gene regulation network inference method based on a graph neural network and contrast learning, and aims to solve the problem that the dynamic change of a gene regulation relationship under single-cell resolution is difficult to reveal because cell heterogeneity cannot be described in the existing single-group learning method. The method comprises the following steps of firstly generating a cell token and a gene token through contrast learning and graph representation learning pre-training, then carrying out multi-mode fusion on the cell token and the gene expression data subjected to interpolation processing to construct context specific characterization of each gene in each cell, further modeling cell specific regulation interaction between a transcription factor and a target gene by adopting a graph neural network with a cross attention mechanism, realizing single-cell-level dynamic gene regulation network inference, and finally optimizing model parameters through combination of contrast learning and expression reconstruction. The invention can accurately infer the dynamic regulation and control relation of cell state dependence and provides an important calculation tool for development analysis and research.

Inventors

ZHENG RUIQING
LI MIN
SHI XINGYUAN
ZENG YANPING

Assignees

中南大学

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (10)

1. A single-cell level dynamic gene regulation network inference method based on graph neural network and contrast learning is characterized by comprising the following steps: preprocessing the original scRNA-seq data to obtain a gene expression matrix; pre-training the cell token by determining the neighborhood relation of the cells based on the gene expression matrix to construct a positive sample; pre-training the gene token by adopting a induction graph neural network guided by a priori gene regulation network to extract the gene token; interpolation of gene expression matrix, namely, based on the gene expression matrix after pretreatment, carrying out weighted fusion on gene expression of cells and average gene expression of a positive sample neighborhood of the cells to obtain the gene expression matrix after interpolation; Performing multi-mode fusion on the cell token, the gene token and the interpolated gene expression data to generate a specific expression vector of each gene in each cell; Inputting the specific expression vector of the cell gene into a graphic neural network model, calculating the regulation and control relation between the transcription factor and the target gene in the cell based on a multi-head cross attention mechanism to output a gene regulation and control network of the cell; Expression reconstruction, namely reconstructing gene expression of the cells by using the cell embedding vector through a feedforward network decoder; model training, namely training parameters of the graph neural network model by combining optimization and comparison learning loss function and weighted reconstruction loss function; The regulation network deduces that the specific expression vector of all genes of the target cell is obtained, the specific expression vector is input into the trained graphic neural network, and the gene regulation network of the target cell is deduced.
2. The method of claim 1, wherein preprocessing the raw scRNA-seq data comprises removing cells having a number of genes less than a preset value, removing genes expressed in less than the preset number of cells, normalizing the gene expression value of each cell to the same level, and screening for highly variable genes; And/or determining a neighborhood relation of the cells by adopting a K neighbor algorithm; and/or embedding the cells using a momentum contrast learning framework to represent learning.
3. The method of claim 1, wherein the generalized graph neural network model comprises two SAGEConv convolution layers, and wherein a batch normalization layer, a PReLU activation function, and a Dropout layer are disposed between the SAGEConv convolution layers; The induction type graph neural network model guided by the prior gene regulation network comprises the steps of defining positive sample edges based on edge indexes of the prior gene regulation network, generating negative sample edges through a structured negative sampling method, calculating dot products of gene token at two ends of the positive and negative sample edges as scores of corresponding edges through a sigmoid function, training the induction type graph neural network model through log likelihood of maximizing the positive sample edges and minimizing the negative sample edges, wherein structured negative sampling refers to that for each positive sample edge By fixing the head node gene And randomly sampling a tail node gene To construct a negative sample edge And (2) and , Representing genes in a prior genetic control network Genes with connecting edges between them.
4. The method of claim 1, wherein the feature fusion generates the specific representation vector by the formula: ; Wherein, the Represents the first In the individual cells of the first kind Cell-gene specific expression vectors of the individual genes, Representing a hidden dimension; Represents the first Cell token of individual cells; Represents the first A gene token of the individual gene; Represents the first In the individual cells of the first kind And gene expression data of the interpolated gene expression values.
5. The method according to claim 1, wherein the calculation of the regulatory relationship between transcription factors and target genes in cells based on a multi-headed cross-attention mechanism comprises: respectively carrying out nonlinear transformation on the specificity expression vector of the transcription factor and the specificity expression vector of the target gene through different multi-layer perceptron to obtain a query matrix Key matrix Sum matrix ; ; Wherein, the And Representing two different multi-layer perceptrons, And Respectively, generating a query matrix The trainable weight matrix and bias vector used; And Respectively, generating key matrix The trainable weight matrix and bias vector used; And Respectively generating a value matrix The trainable weight matrix and bias vector used; Representative of Fusion feature vectors of the transcription factors in n cells, Representative of Fusion eigenvectors of the individual target genes in n cells, The dimension of the fusion feature vector; , The number of the multiple heads; representing the characterization of the transcription factor, Representing target gene characterization; Construction of mask matrix using a priori gene regulation network For filtering non-existent regulation and control relation in attention calculation based on said inquiry matrix Key matrix Mask matrix Calculating the attention score between the transcription factor and the target gene through a multi-head cross attention mechanism; ; Wherein, the 、 Respectively represent the first Individual cell number Transcription factor in individuals And target genes Is characterized by (2); Represents the first Individual cell number Transcription factor in individuals And target gene Is used to determine the attention score of (a), namely, correlation; The attention score obtained by calculation is firstly passed through Normalizing, and averaging in multiple dimensions to obtain adjacent matrix representing regulation and control relation strength Namely the first Gene regulation network of individual cells: ; ; Wherein, the Represents the first Transcription factor of individual cells And target gene Regulatory relationships in a gene regulatory network.
6. The method according to claim 1, wherein the step of generating the cell embedding vector specifically comprises: The transcription factor characterization is weighted and summed by using the attention score of the gene regulation network of the cell to obtain the context-aware characterization of the target gene, and the specific calculation formula is as follows: ; Wherein the method comprises the steps of Is an attention score; Is a transcription factor characterization; is a context-aware characterization of the obtained target gene; Carrying out nonlinear transformation on the context perception representation through a first multi-layer perception machine to obtain a transformed target gene representation; Residual connection is carried out on the transformed target gene representation and the initial transformation characteristics of the target gene, so that enhanced target gene representation is obtained; Performing layer normalization processing on the enhanced target gene characterization, and then performing feature extraction through a second multi-layer perceptron; The second multi-layer perceptron outputs of all target genes are summed in the gene dimension to generate a cell embedding vector that characterizes the overall cell state.
7. The method according to claim 1, wherein the model training step comprises: constructing a contrast learning loss function, and pulling a positive sample and a negative sample based on a cell embedding vector; constructing a weighted reconstruction loss function, and endowing a reconstruction error of a non-zero value in the gene expression matrix with higher weight; And jointly optimizing the contrast learning loss and the weighting reconstruction loss, training parameters of a graph neural network and a decoder, and simultaneously keeping the parameters of the pre-trained cell token and the gene token fixed.
8. The method of claim 7, wherein the weighted reconstruction loss function is: ; Wherein, the Representing a weighting reconstruction penalty; representing the calculated mean square error; Represents the portion of the gene expression matrix after pretreatment having a value of 0, Representing a portion of the gene expression matrix after pretreatment having a value other than 0; Represented by The graph neural network model predictors at the corresponding locations, Represented by And predicting the value of the graph neural network model at the corresponding position.
9. The method of claim 7, wherein the contrast learning loss function is: ; Wherein, the Representing contrast learning loss; A cell intercalation vector for a single target cell, Is in combination with A single positive sample cell embedding vector that constitutes a positive sample pair; To contrast the negative sample cells in the learning dynamic queue to embed a set of vectors, For comparison learning the number of negative samples in the dynamic queue; The temperature coefficient is a super parameter greater than zero.
10. A single-cell level dynamic gene regulation network inference system based on a graph neural network and contrast learning, comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to implement the method of any one of claims 1-9.

Description

Single-cell level dynamic gene regulation network inference method and system based on graphic neural network and contrast learning Technical Field The invention belongs to the field of bioinformatics and computational biology, and relates to a single-cell dynamic gene regulation network inference method and system based on a graph neural network and contrast learning. Background Gene regulation networks are central to understanding the mechanisms of cellular vital activity and function. The traditional single-group gene regulation network inference method is mostly based on population cell data, only an average regulation relationship can be obtained, and single-cell level heterogeneity cannot be revealed. Single cell RNA sequencing (scRNA-seq) technology is an important tool in current genomics research, and is capable of analyzing transcriptomes of single cells through high-throughput sequencing to reveal gene expression differences of the cells under different developmental, differentiation and pathological states, but accurate inference of dynamic gene regulation networks under single cell resolution still faces great challenges that dynamic changes of regulation relations are difficult to capture due to cell heterogeneity, single cell data has the characteristics of high dimension, sparsity and high noise, accuracy of GRN inference is seriously affected, and a calculation model capable of capturing cell heterogeneity and dynamic regulation relations simultaneously is lacking. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a method and a system capable of accurately deducing a single-cell dynamic gene regulation network. In order to achieve the above purpose, the invention adopts the following technical scheme: a single-cell level dynamic gene regulation network inference method based on graph neural network and contrast learning comprises the following steps: preprocessing the original scRNA-seq data to obtain a gene expression matrix; pre-training the cell token by determining the neighborhood relation of the cells based on the gene expression matrix to construct a positive sample; Pre-training the gene token by adopting a induction graph neural network model guided by a priori gene regulation network to extract the gene token; interpolation of gene expression matrix, namely, based on the gene expression matrix after pretreatment, carrying out weighted fusion on gene expression of cells and average gene expression of a positive sample neighborhood of the cells to obtain the gene expression matrix after interpolation; Performing multi-mode fusion on the cell token, the gene token and the interpolated gene expression data to generate a specific expression vector of each gene in each cell; Inputting the specific expression vector of the cell gene into a graphic neural network model, calculating the regulation and control relation between the transcription factor and the target gene in the cell based on a multi-head cross attention mechanism to output a gene regulation and control network of the cell; Expression reconstruction, namely reconstructing gene expression of the cells by using the cell embedding vector through a feedforward network decoder; model training, namely training parameters of the graph neural network model by combining optimization and comparison learning loss function and weighted reconstruction loss function; The regulation network deduces that the specific expression vector of all genes of the target cell is obtained, the specific expression vector is input into the trained graphic neural network, and the gene regulation network of the target cell is deduced. Further, preprocessing the raw scRNA-seq data includes removing cells having a number of genes less than a preset value, removing genes expressed in less than the preset number of cells, normalizing the gene expression value of each cell to the same level, and screening for highly variable genes; And/or determining a neighborhood relation of the cells by adopting a K neighbor algorithm; and/or embedding the cells using a momentum contrast learning framework to represent learning. Further, the inductive graph neural network model comprises two SAGEConv convolution layers, wherein a batch normalization layer, a PReLU activation function and a Dropout layer are arranged between the SAGEConv convolution layers, and the output of the last SAGEConv convolution layer is used as the gene token; The induction type graph neural network model guided by the prior gene regulation network comprises the steps of defining positive sample edges based on edge indexes of the prior gene regulation network, generating negative sample edges through a structured negative sampling method, calculating dot products of gene token at two ends of the positive and negative sample edges as scores of corresponding edges through a sigmoid function, training the induction type graph neural network model through