CN-122000065-A - Unintended re-ICU risk prediction method, system and device

CN122000065ACN 122000065 ACN122000065 ACN 122000065ACN-122000065-A

Abstract

The invention provides an unscheduled return ICU risk prediction method, a system and a device, wherein the method comprises the steps of fusing a MIMIIC-IV public data set and local private data, carrying out multi-source medical data standardization pretreatment through a rule module and a clinical knowledge base, constructing a blocking characteristic system comprising basic characteristics, laboratory indexes, hemodynamic parameters and respiratory metabolism dimensions, adopting a sliding window and characteristic interaction method to optimize characteristic expression, carrying out medical field fine tuning based on a large language model, converting structural characteristics into natural language description by combining prompt words, carrying out model training by adopting optimization strategies such as label smoothing, gradient cutting and the like, evaluating model performance through a mode of combining internal verification and external verification, and utilizing an SHAP framework to realize model prediction interpretability analysis. The scheme remarkably improves the accuracy rate of ICU reversion risk prediction, generalization capability and applicability of actual scenes.

Inventors

Aziguri Ulamu
HAN YU
ZHANG DEZHENG

Assignees

北京科技大学

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (10)

1. A method of unscheduled re-entry ICU risk prediction, comprising: S1, acquiring a public data set and a private data set to form an original data set, and splitting the original data set into a training set and a verification set; S2, extracting multidimensional features of the data based on the preprocessed data, constructing a feature set, performing feature space optimization on the feature set to obtain a processed feature set, and converting the processed feature set into structural features in a serialization format suitable for a large language model, wherein the multidimensional features are extracted from four types of dimension data, namely a basic and combination syndrome dimension, a laboratory test index dimension, a hemodynamic monitoring dimension and a respiratory and metabolic parameter dimension; S3, converting the structural features into coherent natural language description texts, inputting the coherent natural language description texts as a large language model, and performing medical field adaptation on the large language model to obtain a preliminary model; S4, dividing the data from the public data set in the preprocessed data into a training set and an internal verification set, training the preliminary model by using the training set, smoothly enhancing the generalization capability through the labels in the training process, and implementing early stop based on a loss function on the internal verification set to obtain a fine-tuned model; s5, performing performance evaluation on the trimmed model on the internal verification set to obtain an internal verification result; s6, taking a private data set part in the preprocessed data as an external verification set, testing the fine-tuned model, and comparing a test result with an internal verification result to further evaluate the fine-tuned model to obtain an optimal prediction model; S7, carrying out interpretability analysis on the optimal prediction model, quantifying the contribution degree of each feature to the prediction result, and identifying key risk factors in the features; S8, carrying out light weight processing on the optimal prediction model, and packaging.
2. The method according to claim 1, wherein in S1, the automatic cleaning is performed by: Dynamically generating an abnormal value judgment boundary based on a sample individual baseline and a sample population statistical quantile for numerical data, and identifying and correcting an abnormal value by combining an absolute safety range of the field; the missing value filling method comprises extracting Euclidean distance of data as fusion value characteristic of automatically cleaned data Calculating hamming distance of data as classification feature Calculating dynamic time warping distance of data as time sequence feature And calculating the mixed similarity: Taking the mixed similarity as a distance model in K-neighbor calculation, wherein W 1 、W 2 、W 3 is a weight term and W 1 +W 2 +W 3 =1, and secondly, filling in values for missing values of continuous variables The method comprises the following steps: Wherein For each neighbor corresponding value, weight , The filling mode for the classified variable is that the frequency of each occurrence in the neighbor is calculated Each class is randomly sampled as a padding value.
3. The method according to claim 1, wherein in S2, the multi-dimensional feature is extracted by: for continuous monitoring index data, calculating statistical characteristics of the data in the window through a sliding window, wherein the statistical characteristics comprise a mean value, a standard deviation, a variation coefficient and a trend slope; extracting frequency domain features from the declarative physical sign index data through Fourier transformation; for the continuous numerical data, the processing mode is that for the data with the bias distribution, box-Cox transformation is firstly carried out to enable the data to be close to the normal distribution, and then standardized processing is carried out: wherein mu represents the mean value of the data, sigma represents the standard deviation of the data on the training set, x represents the data to be standardized, and z represents the standardized data; mapping each classification characteristic value to a characteristic space with a preset size through a hash function to convert high-dimensional sparse discrete characteristics into low-latitude characteristics with fixed dimensions; After the multidimensional feature extraction, adding the multidimensional feature into a feature set.
4. A method according to claim 3, wherein the processed feature set is obtained by: s301, constructing high-order features based on expert experience, and adding the high-order features into a feature set, wherein the high-order features comprise index ratio features, physiological index product features and trend change combination features; S302, aiming at the extracted multidimensional features, mining interaction relations among the multidimensional features by gradient lifting decision trees to obtain interaction strength among the multidimensional features; s303, generating new candidate interaction features for feature pairs with interaction strength exceeding a preset threshold value in the multidimensional features; s304, evaluating the feature importance of the candidate interaction features, and adding the candidate interaction features meeting evaluation requirements into a feature set; S305, calculating importance bisection of all features in the feature set, sorting, taking N features before ranking to form a candidate feature subset, recursively removing features with the lowest importance by using a random forest aiming at the candidate feature subset to obtain the feature subset, and carrying out principal component analysis on the feature subset to obtain the feature set after processing.
5. The method according to claim 1, wherein in S3, the medical domain adaptation of the large language model is specifically performed by: setting a dialogue prompt word template, and defining a task target through the prompt word; Adapters are added to the attention layer and the output projection layer of the large language model, the initialization adapter weight adopts kaiming normal distribution, the large language model is finely adjusted, and the parameter update ratio of the large language model is not more than 5%.
6. The method according to claim 1, wherein in S4, the label smoothing is performed in a specific manner that, for the classification task in training, the classified real labels are processed by converting the original binary real labels into soft labels with values within [ epsilon, 1-epsilon ] range, and the conversion manner is as follows: Wherein, the A soft label is represented as such, Representing the true label, K representing the number of categories of the classification, epsilon representing the smoothing factor.
7. The method of claim 6, wherein in S4, a learning rate in the training process adopts a hot start scheduling mode, and the learning rate scheduling mode is: Where η t represents the current learning rate at the T-th training step, η min represents the lower limit value in the learning rate scheduling process, η max represents the upper limit value in the learning rate scheduling process, T cur represents the number of training steps performed calculated from the current cycle, and T max represents the total number of training steps calculated from the current cycle.
8. The method according to claim 1, wherein S7 specifically comprises: s701, initializing an interpreter, and using an optimal prediction model and a background data set as reference, wherein the background data set is a data set formed by randomly extracting a plurality of training samples, and the reference is used for defining the average state of the optimal prediction model; s702, calculating SHAP values of each characteristic value as contribution degrees of each characteristic to sample prediction risk for a single sample to be interpreted by taking a background data set as a calculation reference; s703, calculating SHAP values of all the features of all the samples based on all the verification set samples, calculating absolute values of the SHAP values of all the verification set samples for each feature, calculating an average value to obtain an average absolute SHAP value, and sorting all the features in a descending order based on the average absolute SHAP value to obtain a feature importance sorting list so as to identify key risk factors.
9. An unscheduled re-back ICU risk prediction system, wherein the system is configured to perform the method of any one of claims 1-8, the system comprising: The data receiving module is used for receiving the public data set, the local private data set and the real-time input request of the user; The data processing module comprises a data preprocessing unit, a characteristic engineering unit, a large language model fine tuning unit, a training optimizing unit, a model verifying unit, an interpretability analyzing unit and a system deployment unit; the data preprocessing unit is used for preprocessing an original data set to obtain preprocessed data; The feature engineering unit is used for extracting multidimensional features from the preprocessed data, constructing a feature set, performing feature space optimization on the feature set to obtain a processed feature set, and converting the processed feature set into structural features suitable for a serialization format of a large language model; the large language model fine-tuning unit is used for performing medical field adaptation on the large language model to obtain a preliminary model; The training optimization unit is used for dividing the data from the public data set in the preprocessed data into a training set and an internal verification set, training the preliminary model by using the training set, smoothly enhancing the generalization capability through the labels in the training process, and implementing early stop based on a loss function on the internal verification set to obtain a fine-tuned model; The model verification unit is used for performing performance evaluation on the trimmed model on the internal verification set to obtain an internal verification result, testing the trimmed model by taking a private data set part in the preprocessed data as an external verification set, and comparing the test result with the internal verification result to further evaluate the trimmed model to obtain an optimal prediction model; The interpretive analysis unit is used for performing interpretive analysis on the optimal prediction model, quantifying the contribution degree of each feature to the prediction result and identifying key risk factors in the features; The system deployment unit is used for realizing the light weight, containerization deployment and real-time early warning of the optimal prediction model; and the result generation module is used for sending the prediction result and the risk assessment report to a clinical decision support system.
10. An unscheduled re-back ICU risk prediction device comprising a processor, a memory and a bus, the memory storing instructions and data read by the processor, the processor for invoking the instructions and data in the memory to perform the method of any of claims 1-8, the bus connecting the functional components for transferring information therebetween.

Description

Unintended re-ICU risk prediction method, system and device Technical Field The invention relates to the technical fields of medical artificial intelligence, large language models and model training, in particular to a method, a system and a device for predicting risk of an unplanned return ICU based on a large language model. Background Unplanned re-ICU risk prediction is an important problem in the field of severe medicine, and current methods based on traditional machine learning have obvious limitations: Firstly, the existing prediction model depends on a single machine learning algorithm (such as logistic regression, random forest and the like), so that complex nonlinear relation and time sequence dynamic characteristics in ICU clinical data are difficult to capture effectively, and further improvement of prediction accuracy is limited. Secondly, most models are trained and validated only on a single data set, and lack of external validation across medical institutions results in insufficient generalization capability of the model, which is difficult to adapt to data characteristics and patient population differences in different regions. In addition, the traditional model is usually used as a black box, and lacks an interpretability analysis mechanism, so that medical staff can hardly understand the prediction basis of the model, and the acceptance and the credibility in practical application are reduced. In addition, the existing method has limited processing capability on multi-source heterogeneous medical data, cannot effectively integrate multi-mode data such as electronic medical records, vital sign monitoring, laboratory inspection and the like, has low real-time prediction efficiency, and is difficult to meet clinical instant decision requirements. Disclosure of Invention The invention aims to provide an unplanned re-returning ICU risk prediction method, system and device based on a large language model, which are used for solving at least one technical problem in the prior art, for example, the traditional ICU risk prediction model depends on a single machine learning algorithm and is difficult to capture the nonlinear relation of complex clinical characteristics, and most models lack cross-mechanism verification and interpretation analysis, so that the prediction accuracy is low, the generalization capability is poor, and the clinical real-time decision requirement cannot be met. Specifically, the invention provides the following scheme: in a first aspect, the present invention provides a method for risk prediction of an unplanned re-flow ICU, comprising: S1, acquiring a public data set and a private data set to form an original data set, and splitting the original data set into a training set and a verification set; S2, extracting multidimensional features of the data based on the preprocessed data, constructing a feature set, performing feature space optimization on the feature set to obtain a processed feature set, and converting the processed feature set into structural features in a serialization format suitable for a large language model, wherein the multidimensional features are extracted from four types of dimension data, namely a basic and combination syndrome dimension, a laboratory test index dimension, a hemodynamic monitoring dimension and a respiratory and metabolic parameter dimension; S3, converting the structural features into coherent natural language description texts, inputting the coherent natural language description texts as a large language model, and performing medical field adaptation on the large language model to obtain a preliminary model; S4, dividing the data from the public data set in the preprocessed data into a training set and an internal verification set, training the preliminary model by using the training set, smoothly enhancing the generalization capability through the labels in the training process, and implementing early stop based on a loss function on the internal verification set to obtain a fine-tuned model; s5, performing performance evaluation on the trimmed model on the internal verification set to obtain an internal verification result; s6, taking a private data set part in the preprocessed data as an external verification set, testing the fine-tuned model, and comparing a test result with an internal verification result to further evaluate the fine-tuned model to obtain an optimal prediction model; S7, carrying out interpretability analysis on the optimal prediction model, quantifying the contribution degree of each feature to the prediction result, and identifying key risk factors in the features; S8, carrying out light weight processing on the optimal prediction model, and packaging. Preferably, in the step S1, the automatic cleaning method is as follows: Dynamically generating an abnormal value judgment boundary based on a sample individual baseline and a sample population statistical quantile for numerical data, and identifying and correc