CN-122020104-A - Single-cell track deducing method based on self-adaptive feature selection
Abstract
The invention belongs to the field of bioinformatics, and relates to a single-cell trajectory inference method based on self-adaptive feature selection. Firstly, obtaining an initial gene expression matrix through data preprocessing and highly variable gene screening, secondly, adopting a two-dimensional evaluation strategy to respectively calculate highly variable gene scores of gene expression variability and track importance scores related to differentiation tracks, then introducing a dynamic weight fusion mechanism, adaptively adjusting fusion weights of the two types of scores based on performance feedback, highlighting key genes through nonlinear enhancement, secondly, adopting an intelligent inflection point detection algorithm to adaptively determine the optimal feature quantity, and finally, carrying out track inference based on a feature subset reconstruction variation self-encoder model, and forming a feature selection closed loop driven by performance through multi-round iterative optimization. The invention realizes high-precision self-adaptive single-cell track inference and solves the technical problems of single feature selection and fixed weight of the traditional method.
Inventors
- LUO JING
- ZHOU SHUSEN
- LIU TONG
- LIU CHANJUAN
- WANG QINGJUN
- ZANG MUJUN
Assignees
- 鲁东大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260413
Claims (6)
- 1. A single cell trajectory inference method based on adaptive feature selection, comprising the steps of: Step 1, data preprocessing and two-dimensional gene importance assessment, namely inputting single-cell RNA sequencing data, obtaining an initial gene expression matrix through standardization, logarithmization and highly variable gene screening, adopting a two-dimensional gene assessment strategy, respectively calculating highly variable gene scores based on expression variability and track importance scores related to differentiation tracks, and outputting two types of score vectors; step 2, dynamic weight fusion and nonlinear enhancement, namely taking the score vector output in the step 1 as input, adaptively adjusting fusion weight through a performance feedback closed loop, introducing nonlinear enhancement term to highlight key genes, and outputting as a comprehensive score vector; Step 3, determining the self-adaptive feature quantity, namely taking the fraction vector output in the step 2 as input, adopting an intelligent inflection point detection algorithm, automatically determining the optimal feature quantity through descending order sorting, smooth filtering, inflection point detection and accumulated variance interpretation rate calculation, and outputting as a feature selection mask; Step 4, model dimension self-adaptive reconstruction and training, namely taking the feature selection mask output in the step 3 as input, extracting a feature subset according to the mask to construct new input data, storing original variation self-encoder parameters, reconstructing a variation self-encoder model based on the new dimension, completing training on the new feature subset, and outputting a model adapting to the new dimension; And 5, performing iterative optimization and track inference driven by performance, namely performing multi-round iterative optimization by taking the model output in the step 4 as input, performing posterior estimation in each round, constructing a backbone network, performing track inference, adjusting the fusion weight in the step 2 according to the performance change trend, recording the corresponding feature selection result, and finally outputting the track inference result under the optimal feature set.
- 2. The method for estimating single cell trajectories based on adaptive feature selection according to claim 1, wherein in the step 1, the highly variable gene score calculation is performed by preferentially using a highly variable gene ranking value, mapping the ranking to an interval by a ranking conversion formula, using a normalized dispersion score if ranking information is not available, calculating a variation coefficient based on a gene expression mean and a variance if the dispersion information is not available, normalizing all scores to the interval to obtain a highly variable gene score vector, and performing trajectory importance score calculation by priority, if pseudo-time information is available, calculating a correlation absolute value of a gene expression and a pseudo-time by using a Spearman rank correlation coefficient, if a real cell type tag is available, using a single factor variance analysis calculation value as an inter-group difference score, and if no trajectory information is available, using a standard deviation of the gene expression as a substitution score, and normalizing all scores to the interval to obtain a trajectory importance score vector.
- 3. The method for estimating single cell trajectories based on adaptive feature selection according to claim 1, wherein in the step 2, a dynamic weight fusion mechanism maintains two dynamic weight parameters, wherein the sum of the contribution degrees of the highly variable gene score and the trajectory importance score is 1, the fusion calculation adopts a strategy of combining linear weighting with nonlinear enhancement, the linear weighting results are obtained by multiplying the two types of scores by the corresponding weights respectively, adding the two types of scores, introducing a nonlinear enhancement term on the basis, squaring the primary fusion score and multiplying the primary fusion score by a fixed coefficient to form a final comprehensive fusion score, the weight dynamic adjustment depends on a performance feedback closed loop, the system calculates the comprehensive score based on the performance feedback closed loop by comparing the comprehensive score of the current round with the comprehensive score of the previous round after each round of iteration is completed, and updates the fusion weight by adopting a hierarchical adjustment strategy, wherein the method comprises the steps of enhancing the current weight in a dominant direction when the performance is significantly improved, adding random disturbance when the performance is slightly improved, conducting fine adjustment when the performance is stable or slightly reduced, conducting random reduction in a small range, and exploring the optimal performance by the historical weight when the performance is significantly reduced.
- 4. The single-cell trajectory inference method based on adaptive feature selection according to claim 1 is characterized in that in the step 3, an intelligent inflection point detection algorithm for determining the number of adaptive features specifically comprises the steps of sorting fusion scores in a descending order to obtain an ordered sequence, adopting a two-stage smoothing strategy to process the sequence, firstly carrying out primary smoothing through sliding window average, secondly carrying out secondary smoothing through a Savitzky-Golay filter, calculating a first-order difference and a second-order difference of the smoothed sequence, taking the position corresponding to the minimum value point of the second-order difference as a candidate inflection point, calculating the proportion of the accumulated sum of the sorted sequences to the sum of the accumulated sum, locating the position where the accumulated proportion reaches a preset threshold value for the first time as an information contribution inflection point, integrating the candidate inflection point and the information contribution inflection point, combining a preset minimum feature number and a preset maximum feature number range, automatically determining the optimal feature number, and generating a feature selection mask according to the optimal feature number.
- 5. The single-cell trajectory inference method based on adaptive feature selection according to claim 1, wherein in the step 4, a specific process of model dimension adaptive reconstruction includes extracting a new feature subset from an original gene expression matrix according to a feature selection mask, constructing a new input data matrix with reduced dimensions, creating a new encoder, adjusting input layer dimensions to new feature numbers, keeping subsequent hidden layer dimensions consistent with an original model, creating a new decoder, adjusting output layer dimensions to new feature numbers, inversely corresponding hidden layer dimensions to the original model, reinitializing a potential space layer, keeping discrete state numbers and latent variable dimensions unchanged, directly loading pre-training weights of the original model for hidden layers and latent variable layers with unchanged dimensions, and adopting a random initialization or partial mapping strategy for the input layers and the output layers with changed dimensions.
- 6. The method according to claim 1, wherein in the step 5, performance-driven iterative optimization is performed by performing posterior estimation by using monte carlo sampling, backbone network construction is performed by adopting a specified method and filtering low confidence edges through a threshold, a directed acyclic graph is constructed based on the backbone network and a root node, cell pseudo-time is calculated, performance indexes, the number of selected features and fusion weights of each iteration are recorded to a history queue, an optimal performance record and a corresponding feature mask are updated when comprehensive performance is better than the history optimal, a final output trajectory estimation result comprises the backbone network, the cell pseudo-time, cell state allocation and uncertainty estimation, and the system outputs a complete iteration history record at the same time.
Description
Single-cell track deducing method based on self-adaptive feature selection Technical Field The invention belongs to the field of bioinformatics, and particularly relates to a single-cell trajectory inference method based on self-adaptive feature selection. Background Single cell trajectory inference is one of the core analytical tasks that reveals dynamic changes in cell differentiation, lineage commitment, and disease progression, and feature selection is a key pre-step in single cell trajectory inference, directly affecting the performance and biological interpretation of the model. The existing single-cell trajectory inference method has the technical defects that ① characteristic selection criteria are single, a highly variable gene is adopted as an input characteristic, genes which express relatively stably but have decisive action on a cell differentiation trajectory are ignored, ② characteristic weights are fixed, weights of the highly variable gene and the trajectory related gene are fixed in an analysis process, characteristics of different data sets are ignored and cannot be dynamically optimized along with a model training process, ③ characteristic quantity is fixed, complexity and information distribution of the different data sets are ignored, characteristic quantity cannot be determined based on the internal structure and the characteristics of the data, ④ closed-loop optimization driven by lack of performance is performed, the existing characteristic selection is usually performed independently as a preprocessing step, and feedback correlation between a characteristic selection result and follow-up trajectory inference performance is ignored. ⑤ The feature importance assessment is incomplete, namely, the gene importance is assessed by adopting a single index, the biological importance of the gene is difficult to comprehensively describe, the feature selection result is biased to information in a certain aspect, and other key features are lost. Therefore, a single-cell trajectory inference method capable of adaptively fusing multidimensional gene importance information, dynamically adjusting feature weights, intelligently determining feature quantity and having performance-driven closed-loop optimization capability is needed in the art. Disclosure of Invention The invention provides a single-cell track deducing method based on self-adaptive feature selection, which aims to solve the problems of single feature selection standard, fixed weight, stiff quantity and lack of performance feedback closed loop in the prior art. The specific technical scheme comprises the following 5 steps: Step 1, data preprocessing and two-dimensional gene importance assessment, namely inputting single-cell RNA sequencing data, obtaining an initial gene expression matrix through standardization, logarithmization and highly variable gene screening, adopting a two-dimensional gene assessment strategy, respectively calculating highly variable gene scores based on expression variability and track importance scores related to differentiation tracks, and outputting two types of score vectors. And 2, dynamic weight fusion and nonlinear enhancement, wherein the score vector output in the step 1 is used as an input. And (3) adaptively adjusting the fusion weight through a performance feedback closed loop, introducing a nonlinear enhancement term to highlight key genes, and outputting the key genes as a comprehensive score vector. And step 3, determining the self-adaptive feature quantity, namely taking the score vector output in the step 2 as input. And an intelligent inflection point detection algorithm is adopted, and the optimal feature quantity is automatically determined through descending order sorting, smooth filtering, inflection point detection and accumulated variance interpretation rate calculation, and is output as a feature selection mask. And 4, model dimension self-adaptive reconstruction and training, namely taking the feature selection mask output in the step 3 as input. New input data is constructed from the mask extracted feature subset. And (3) saving original variation self-encoder parameters, reconstructing a variation self-encoder model based on the new dimension, completing training on the new feature subset, and outputting a model adapting to the new dimension. And 5, performing performance-driven iterative optimization and trajectory inference, namely performing multi-round iterative optimization by taking the model output in the step 4 as input. And (3) carrying out posterior estimation, backbone network construction and track inference in each round, adjusting the fusion weight in the step (2) according to the performance change trend, recording the corresponding feature selection result, and finally outputting the track inference result under the optimal feature set. A single-cell track deducing method based on self-adaptive feature selection comprises the following implementation process of step 1