CN-121980393-A - Single-cell sequencing data timing sequence analysis method based on Kalman filtering

CN121980393ACN 121980393 ACN121980393 ACN 121980393ACN-121980393-A

Abstract

The invention discloses a single-cell sequencing data quasi-time sequence analysis method based on Kalman filtering, which comprises the steps of processing original single-cell sequencing data, carrying out quasi-time sequence initial sequencing and defining a state space model, constructing a state transition matrix, an observation matrix, a process noise covariance matrix and an observation noise covariance matrix, initializing the state covariance matrix, sequentially executing a prediction step and an updating step of Kalman filtering on each cell according to the quasi-time sequence initial sequence to obtain a corrected cell hiding state, adjusting pseudo-time sequencing based on the corrected hiding state, repeating iteration until the track converges, and reconstructing a cell quasi-time sequence track based on the converged cell hiding state and the optimized pseudo-time. According to the Kalman filtering-based single-cell sequencing data timing sequence analysis method, a state space model adapting to single-cell data characteristics is constructed, and Kalman filtering is utilized to realize stable and accurate reconstruction of a cell dynamic track.

Inventors

SUN XIAONAN
YU FENG
XIE XUBIN
WANG XUANXUAN
ZHOU WEIWEN
LI CHENPU

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (10)

1. A single-cell sequencing data timing sequence analysis method based on Kalman filtering is characterized by comprising the following steps of: performing quality control, standardization treatment, differential expression gene screening and dimension reduction treatment on the original single cell sequencing data to obtain a low-dimension single cell expression characteristic matrix; Performing quasi-time sequence initial sequencing based on the single cell expression feature matrix and defining a state space model, wherein the state space model comprises a hidden state, an observed value, a state transition equation, an observed equation, a process noise covariance matrix and an observed noise covariance matrix; constructing a state transition matrix, an observation matrix, a process noise covariance matrix and an observation noise covariance matrix; Selecting the expression characteristics of the track starting point cells as an initial hidden state according to the quasi-time sequence initial sequencing result, and initializing a state covariance matrix; sequentially executing a prediction step and an updating step of Kalman filtering on each cell according to the initial sequence of the pseudo-time sequence to obtain a corrected cell hiding state, adjusting pseudo-time sequencing based on the corrected hiding state, and repeating iteration until the track converges; reconstructing a cell timing trajectory based on the converged cell hiding state and the optimized pseudo-time.
2. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, The quality control comprises filtering low quality cells with excessively high mitochondrial gene occupancy, excessively low gene detection number or excessively low sequencing depth; The normalization treatment comprises the steps of eliminating the influence of sequencing depth difference on gene expression values by adopting a Log conversion, median centering or SCTransform method, so as to ensure that the gene expression levels among different cells are comparable; Screening the differential expression genes among different cell populations by Wilcoxon rank sum test or MAST model; the dimension reduction treatment adopts principal component analysis to reduce the dimension of the expression matrix of the differential expression genes, determines the number K of the optimal principal components according to an elbow rule, and selects the first K principal components to form a low-dimension single cell expression characteristic matrix.
3. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, The quasi-time sequence initial sequencing is realized by adopting a Slingshot method, specifically, a cell cluster is obtained by clustering the single cell expression feature matrix after pretreatment, a minimum spanning tree is constructed based on the connection relation among the cell clusters, and the shortest path length from each cell to a root node is calculated by taking the root node of the tree as a reference and is taken as initial pseudo time.
4. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, Hiding states in the state space model t Defined as the true transcriptome status of the cell at pseudo-time t, t ∈R d D is the number of the first K principal components; Observations of t Defined as a cell expression feature vector corresponding to a pseudo time t in a low-dimensional single cell expression feature matrix, t ∈R d ; The state transition equation is t = A· t-1 + t Wherein A is d x d dimension state transition matrix, which represents state transition rule of cells from pseudo time t-1 to t, t For process noise, obeys a Gaussian distribution with an average value of 0 and a process noise covariance matrix of Q, i.e t ~N(0,Q); The observation equation is t = H· t + t Wherein H is d x d dimension observation matrix, representing mapping relation between hidden state and observation value, t For observation noise, obeying a Gaussian distribution with an average value of 0 and an observation noise covariance matrix of R, i.e t ~N(0,R)。
5. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 4, wherein the method comprises the steps of, The construction of the state transition matrix A is combined with biological priori knowledge, if a gene regulation network exists, the state transition matrix A is constructed based on the inter-gene regulation coefficients, and if the priori knowledge does not exist, the state transition matrix A is initialized to be an identity matrix, and then the identity matrix is regulated through iterative optimization; the construction of the observation matrix H is based on a dimension reduction method in a preprocessing stage, if the dimension reduction method of PCA is adopted, H is a load matrix of PCA, if other dimension reduction methods are adopted, H is initialized to be an identity matrix, and the optimal value is obtained through least square fitting.
6. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 4, wherein the method comprises the steps of, The initial value of the process noise covariance matrix Q is set based on the fluctuation range of the cell state, and 0.1 time of the unit matrix is taken; the initial value of the observed noise covariance matrix R is obtained by calculating the variance of gene expression values in the same cell population based on technical noise estimation of single cell sequencing data; The process noise covariance matrix Q and the observed noise covariance matrix R are in the form of full order matrixes, and the elements ij Characterizing process noise correlation of an ith feature to a jth feature, element ij And (3) representing the correlation of the observed noise of the ith feature and the jth feature, and dynamically updating the correlation by a maximum likelihood estimation method.
7. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, The prediction step is based on the optimal hidden state estimation value of the previous moment t-1|t-1 And a state covariance matrix P t-1|t-1 for predicting a hidden state prior estimated value at the current moment t|t-1 = A· t-1|t-1 And a state covariance a priori estimate matrix P t|t-1 = A·P t-1|t-1 ·A T +q, where a T is a transpose of state transition matrix a.
8. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 7, wherein the method comprises the steps of, The updating step calculates the Kalman gain K t = P t|t-1 ·H T ·(H·P t|t-1 ·H T + R) -1 by using the observed value at the current moment t Correcting the predicted value to obtain an optimal hidden state estimated value t|t = t|t-1 + K t ·( t - H· t|t-1 ) And a state covariance best estimate matrix P t|t = (I - K t ·H)·P t|t-1 , where H T is the transpose of the observation matrix H, (·) -1 represents the matrix inversion and I is the identity matrix.
9. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, The method for adjusting pseudo time sequence based on the corrected hidden state comprises the following steps of based on the optimal hidden state estimation values of all cells t|t Calculating Euclidean distance matrix among cells, and adjusting the pseudo time of each cell by adopting a dynamic time regulation method based on the initial sequencing of the pseudo time sequence, wherein convergence is judged to repeatedly execute the prediction step, the updating step and the pseudo time adjustment step until the average value of the pseudo time difference of two adjacent iterations is smaller than a preset threshold value.
10. The Kalman filter-based single cell sequencing data timing sequence analysis method of claim 1, wherein the method comprises the steps of, After the corrected cell hiding state is obtained, a backward smoothing step is executed based on a Rauch-tune-Striebel smoothing algorithm to correct the filtering state, the smoothed cell hiding state is obtained, pseudo-time ordering is adjusted based on the smoothed hiding state, iteration is repeated until the track converges, the backward smoothing step is used for calculating smoothing gain from the last cell backward iteration, correcting the current filtering state by utilizing information of future cells to obtain a more stable smoothing state result and a smoothing covariance matrix, a first dimension is extracted from a smoothing state vector to serve as a quasi-time sequence result and normalized to a [0,1] interval, uncertainty of each cell quasi-time sequence is calculated based on the smoothing covariance matrix, and square root of covariance matrix tracks is used as a quantization index.

Description

Single-cell sequencing data timing sequence analysis method based on Kalman filtering Technical Field The invention belongs to the technical field of intersection of bioinformatics and signal processing, and particularly relates to a single-cell sequencing data timing sequence analysis method based on Kalman filtering. Background The single-cell RNA sequencing technology can capture gene expression information under single-cell resolution, and provides a powerful tool for analyzing cell heterogeneity and revealing cell development and differentiation rules. The quasi-time sequence analysis (Pseudotime Analysis) is one of the core technologies of single cell sequencing data analysis, and the core aim is to infer continuous state transition tracks of cells along the development, differentiation and other processes from discrete single cell snapshot data so as to restore the dynamic change process of the cells. The existing quasi-timing analysis method mainly comprises a method based on dimension reduction clustering (such as Monocle series and Slingshot), a method based on a probability model (such as Wishbone and DPT) and the like, but the methods have a plurality of key defects in practical application. Firstly, the weak noise resistance is a common problem of the existing method, single-cell sequencing data has high noise characteristics generally, and the single-cell sequencing data comprises technical noise such as reverse transcription efficiency difference, sequencing depth fluctuation and biological noise such as random fluctuation of gene expression, and a large number of 'missing values' (Dropout events) at the same time, so that the existing method is difficult to effectively distinguish real gene expression signals from noise, and the reconstructed track has deviation. Secondly, the track topology has poor adaptability, most methods have strong assumptions on the topology structure of the track, such as linearity, tree shape and the like, are difficult to adapt to complex cell fate differentiation processes, such as multi-branch differentiation, cell state reversal and the like, and are sensitive to initial pseudo-time sequencing. Again, the existing methods rely on data-driven dimension reduction and clustering to integrate known biological prior knowledge, such as gene regulation networks, cell cycle rules, etc., with difficulty, resulting in possible violation of biological logic by the reconstructed trajectory. Finally, the suitability of large-scale data is poor, for millions of large-scale single-cell data sets, the existing method mostly adopts a batch processing mode, the calculation complexity is high, the memory occupation is large, and efficient track reconstruction is difficult to realize. Disclosure of Invention The invention provides a single-cell sequencing data timing sequence analysis method based on Kalman filtering, which at least solves one technical problem, and concretely adopts the following technical scheme: a single-cell sequencing data timing sequence analysis method based on Kalman filtering comprises the following steps: performing quality control, standardization treatment, differential expression gene screening and dimension reduction treatment on the original single cell sequencing data to obtain a low-dimension single cell expression characteristic matrix; Performing quasi-time sequence initial sequencing based on the single cell expression feature matrix and defining a state space model, wherein the state space model comprises a hidden state, an observed value, a state transition equation, an observed equation, a process noise covariance matrix and an observed noise covariance matrix; constructing a state transition matrix, an observation matrix, a process noise covariance matrix and an observation noise covariance matrix; Selecting the expression characteristics of the track starting point cells as an initial hidden state according to the quasi-time sequence initial sequencing result, and initializing a state covariance matrix; sequentially executing a prediction step and an updating step of Kalman filtering on each cell according to the initial sequence of the pseudo-time sequence to obtain a corrected cell hiding state, adjusting pseudo-time sequencing based on the corrected hiding state, and repeating iteration until the track converges; reconstructing a cell timing trajectory based on the converged cell hiding state and the optimized pseudo-time. Further, the quality control includes filtering low quality cells with excessively high mitochondrial gene duty cycle, excessively low gene detection number, or excessively low sequencing depth; The normalization treatment comprises the steps of eliminating the influence of sequencing depth difference on gene expression values by adopting a Log conversion, median centering or SCTransform method, so as to ensure that the gene expression levels among different cells are comparable; Screening the differential expression