CN-122024850-A - Cross-group sparse feature selection system and method based on hierarchical causal modeling

CN122024850ACN 122024850 ACN122024850 ACN 122024850ACN-122024850-A

Abstract

The invention provides a cross-histology sparse feature selection system and method based on hierarchical causal modeling, comprising a data input and preprocessing module, a hierarchical causal structure learning module, a causal guiding sparse feature selection module and a model retraining and integrating module, wherein the data input and preprocessing module is used for receiving multi-histology original data of a plurality of samples, the hierarchical causal structure learning module is connected with the data input and adaptive preprocessing module and is used for constructing a hierarchical causal topology of the cross-histology, the causal guiding sparse feature selection module is connected with the hierarchical causal structure learning module, and the model retraining and integrating module is used for constructing a three-layer weighting integrated discrimination model based on screened markers, optimizing fusion weights of all layers through a gradient descent algorithm and outputting a final prediction result. The invention fully utilizes the biological hierarchical relationship of protein-metabolism and the complementary information of serum-urine, has good generalization capability, can be widely applied to the marker mining and predictive modeling of cancers, metabolic diseases and the like, and improves the accuracy and reliability of accurate medical treatment.

Inventors

CAO SHENG
GE HONGLIANG

Assignees

杭州零基医药科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. A transcriptomics sparse feature selection system based on hierarchical causal modeling, comprising: the data input and preprocessing module is used for receiving multiple groups of original data of the multi-element sample, embedding a tunable batch correction unit based on the stable position statistic, and removing systematic deviation of cross batches and cross platforms in an iterative weighting mode; The hierarchical causal structure learning module is connected with the data input and self-adaptive preprocessing module and is used for constructing a hierarchical causal topology of the cross-group science, explicitly defining proteome characteristics as upstream potential regulating variables by referring to a structural equation model, introducing clinical confounding factors as covariates and describing directional dependency relations among the characteristics, wherein the metabonomics characteristics are downstream response variables; The causal guiding sparse feature selection module is connected with the hierarchical causal structure learning module, and is internally provided with a causal gating sparse regularization mechanism and a counter-facts intervention importance calculation unit, and the causal guiding sparse feature selection module is used for automatically pruning a non-causal path by optimizing a sparse loss function containing micro-gating variables, and screening out a sparse, stable biomarker set with causal semantics by evaluating a prediction variable quantity and a Bootstrap stability score under virtual intervention; The model retraining and integrating module is used for constructing a three-layer weighted integration judging model based on the screened markers, optimizing the fusion weight of each layer through a gradient descent algorithm and outputting a final prediction result.
2. The system for cross-histologic sparse feature selection based on hierarchical causal modeling of claim 1, wherein the tunable batch correction unit performs the following: a. Grouping the data by potential bias sources; b. Calculating robust location statistics for each set of data; c. Weighting the statistic based on a preset tunable parameter, and calculating a system deviation correction amount; d. Updating the data by using the correction amount, wherein the tunable parameter is used for controlling the proportion of deviation removal in a single iteration; e. the above process is repeated until the correction amount converges.
3. The hierarchical causal modeling based cross-histology sparse feature selection system of claim 1, wherein the causal gating sparse regularization mechanism is based on path coefficients of a structural equation model Introducing continuously differentiable gating variables Defining significant path coefficients Wherein A structural mask generated for a hierarchy constraint and biological prior knowledge; The optimized objective function of the module includes a function for the significant path coefficients L1 norm sparseness penalty term of (2) and for gating variables Entropy regularization term to achieve automatic selection and sparsification of paths.
4. A cross-group sparse feature selection method based on hierarchical causal modeling is characterized by comprising the following steps: S1, data acquisition and self-adaptive cleaning, namely acquiring multi-element group data of a subject, filling a missing value by using KNN or an associated interpolation method, and removing outlier samples by combining a local anomaly factor algorithm; S2, robust system deviation correction, namely eliminating batch effect and system deviation in data by using an iterative algorithm based on robust position statistics and tunable parameters; s3, modeling a hierarchical structure equation, namely constructing a structure equation model comprising an upstream protein layer, a downstream metabolic layer and a covariant layer, and setting unidirectional causal path constraints which only allow from upstream to downstream; S4, causal gating sparse screening, namely introducing a learnable causal gating variable, combining a sparse regularization training model, automatically pruning a non-causal path, and combining a counterfactual intervention score and multiple Bootstrap resampling stability evaluation to determine a final stable causal marker; S5, multi-layer integrated modeling, namely training each group of learning base models based on causal markers, and outputting a final prediction result through three-layer weighted integration strategy and gradient descent weight optimization of a group learning layer, a sample layer and a body liquid layer.
5. The system for selecting transcriptome sparse features based on hierarchical causal modeling according to claim 4, wherein in step S4, the inverse intervention score is obtained by virtually assigning perturbation to the target feature by fixing other input parameters of the model, calculating the difference amplitude of model predictive output probability vectors before and after perturbation, and quantifying causal contribution of the feature to the predicted result.
6. The system for cross-histologic sparse feature selection based on hierarchical causal modeling of claim 4, wherein in step S4, the stability assessment comprises performing Bootstrap resampling on the raw data multiple times, performing causal gated sparse screening on each resampled data independently, counting the frequency of each feature selected, and preserving the selected frequency above a preset threshold Is characterized by (3).
7. The system for cross-histology sparse feature selection based on hierarchical causal modeling of claim 4, wherein the three-layer weighted integration strategy of step S5 automatically searches for the optimal solution by minimizing the logarithmic loss function on the validation set, and wherein all weights are constrained to be non-negative and normalized to 1.
8. The hierarchical causal modeling based transcriptomic sparse feature selection system according to claim 4, wherein the method is applied to efficacy discrimination of diabetic nephropathy DKD, and wherein the causal markers comprise inflammatory factor receptors TNFR1, TNFR2, complement components C3, CFB, and lipid metabolites PC, LPC, SM, cer driven thereby.
9. The transcriptome sparse feature selection system based on hierarchical causal modeling of claim 4, wherein said method is applied to cancer immunotherapy response prediction, and wherein said causal markers comprise immunocytokines in tumor microenvironment and metabolic reprogramming Cheng Chanwu under control thereof.
10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 4 to 9.

Description

Cross-group sparse feature selection system and method based on hierarchical causal modeling Technical Field The invention relates to the technical field of biomedical data processing and intelligent analysis, in particular to a cross-group sparse feature selection system and method based on hierarchical causal modeling. Background With the rapid development of accurate medical treatment, the role of multiple sets of data (such as metabolomics and proteomics) of multiple samples (such as serum and urine) in disease diagnosis, efficacy monitoring and personalized treatment is increasingly prominent. Various methods for histologic feature screening and discriminant modeling have been proposed in the prior art, such as feature selection algorithms based on sparse regression of LASSO (least absolute shrinkage and selection operator), feature importance scoring of Random Forests (RF), and packaging methods of SFFS (forward floating feature screening). These methods aim to screen key features from Gao Weizu's mathematical data and build integrated models for disease prediction and efficacy assessment. However, in a multi-group high dimensional data scenario, the following drawbacks remain common to existing feature screening and modeling approaches: (1) Correlation/significance driving, it is difficult to distinguish causal contributions from spurious correlations. Univariate statistical screening, correlation screening and most screening modes based on model weight/importance generally take difference significance, correlation strength or prediction contribution as screening basis, and it is difficult to distinguish features with real causal influence on curative effects or disease prediction results from associated features caused by common driving factors, co-linear structures, confounding factors or batch effects. This problem is further accentuated when there is a complex level of dependence between urine and serum two classes of body fluids and proteomes and metabolomes, and a set of candidate markers that lack interpretability and reproducibility is readily available. (2) The selection is unstable under the condition of a high-dimensional small sample, and the reproducibility is poor. Body fluid proteome and metabolome data often exhibit structures with variable numbers much greater than sample size, and measurement noise and missing values are common. Whether the regularization method such as LASSO/ELASTIC NET is adopted, or the tree model importance and network screening method is adopted, the screening results of the regularization method and the tree model importance and network screening method can be highly sensitive to sample disturbance, training set division and parameter setting, and the selected feature set is easy to change obviously under the conditions of different resampling, different centers or different batches, so that the reproducibility is insufficient, and a stable feature panel is difficult to form for subsequent verification and landing. (3) The relationship between the fluid source and the histology is not characterized sufficiently, and the generalization capability is limited. Although the multi-group chemical fusion modeling method (such as PLS-DA, multi-group chemical factor analysis, stacking integration and the like) can improve the prediction performance, the splicing or simple integration mode is often adopted to process the data of different groups of chemical layers and different body fluid sources, and directional dependence and hierarchical structure relations between urine and serum, protein and metabolism cannot be explicitly described. In the presence of cross-group transfer effects or "upstream drive-downstream response" mechanisms, the lack of structural constraints can result in a model with inadequate understanding of the relationships between features, thereby affecting the generalization performance of the model over independent queues or real world data. (4) The lack of feature importance measurement oriented to intervention semantics has insufficient interpretation and verifiability. The interpretation methods such as importance of tree model or ensemble learning, linear model weight, substitution importance or SHAP still belong to association or prediction contribution interpretation, and the output of the interpretation methods is difficult to answer the question of how much influence the output of a discrimination result is caused if virtual intervention or change of the level of a certain molecular feature is carried out. Lack of importance assessment of intervention semantics can limit biological interpretation of feature screening results, experimental verification design, and demonstration of clinical usability. (5) Batch effects, platform differences and measurement noise are prone to introducing bias. The humoral histology data is affected by sample collection, processing flow, mass spectrum platform, peak extraction and quantification strategy, etc.