CN-117171683-B - Medical image focus characteristic value anomaly elimination method
Abstract
The invention provides a medical image focus characteristic value anomaly elimination method, which adopts an anomaly value detection model of comprehensive density and statistics and a weak supervision model based on ensemble learning. The method comprises the steps of dividing a patient characteristic data set into a training set and a test set according to a proportion, performing fitting detection by using an abnormal value detection model with comprehensive density and statistics, performing abnormal value detection on the test set data by using a training set fitting model, counting abnormal characteristic quantity by taking a patient as a unit, correcting abnormal values by using a weak supervision model respectively aiming at the training set data and the test set data, and judging whether the abnormal detection model is effective by using a screening model and a classifier. The method is based on reducing abnormal characteristics, and weakens the influence of abnormal focus characteristic values caused by inaccurate medical image sketching, so that the prediction accuracy is improved.
Inventors
- CHEN JIANXIN
- ZHANG SHUN
- RONG JIAN
- LIU BIN
- XU JINGYAN
Assignees
- 南京邮电大学
Dates
- Publication Date
- 20260512
- Application Date
- 20230821
Claims (5)
- 1. A medical image focus characteristic value abnormality elimination method is characterized by comprising the following specific steps: The method comprises the steps of 1, dividing a data set into a training set and a testing set according to a proportion, carrying out fitting detection on the training set by utilizing an abnormal value detection model integrating density and statistics, and carrying out abnormal value detection on test set data by utilizing a training set fitting model, wherein the comprehensive abnormal value detection model comprises an LOF algorithm based on density detection and a quartile range algorithm based on statistics, and only when the LOF algorithm based on density detection and the quartile range algorithm based on statistics judge that the characteristic value in the characteristic data is an abnormal value, the value is determined to be the abnormal value, otherwise, the value is the normal value; Step 2, counting the abnormal characteristic quantity of each patient according to training set data and test set data by taking the patient as a unit, drawing a characteristic data abnormal quantity distribution diagram of the patient, determining an abnormal threshold value according to the characteristic data abnormal quantity distribution diagram, and determining the number of patients with the abnormal data larger than the threshold value and the number of patients with the normal data smaller than the threshold value; step 3, dividing the normal patient into a normal patient training set and a normal patient testing set according to the training set data and the testing set data; Step 4, for the normal patient training set in the training set data, the normal patient training set in the abnormal patient data and the normal patient training set in the test set data and the abnormal patient data, the weak supervision only uses the label of the data part, namely the label of the normal patient training set, firstly uses an integrated model to fit the relation between the data of the normal patient and the label, and uses the weight relation learned in the normal patient to correct the relation between the abnormal patient data and the label, so as to obtain the corrected label, wherein the integrated model is a random forest model based on a decision tree, and is optimized through feature selection, decision tree establishment, sample sampling and feature importance calculation; step 5, combining the normal patient training set and the abnormal patient training set in the training set data and the testing set data respectively to form an original training set, combining the normal patient training set and the data corrected by weak supervision to form a corrected training set, and taking the testing set of the normal patient as a corresponding testing set; And 6, judging whether the comprehensive anomaly detection method is effective or not by comparing the performance of the model fitted by the original training set on the testing set with the performance of the corrected training set fitted model on the testing set by utilizing a screening model and a classifier respectively for the training set and the testing set, evaluating the prediction capability of the model fitted by the weak supervision corrected training set and the model fitted by the original training set by utilizing the ROC curve area of the model as an AUC (total organic matter) as an index, and if the AUC of the model fitted by the weak supervision corrected training set on the testing set is superior to the AUC of the model fitted by the original training set on the testing set, explaining that the comprehensive anomaly detection method is effective, and weakening the situation that the focus characteristic value is abnormal due to inaccurate sketching of a medical image in subsequent analysis.
- 2. The method for eliminating abnormal values of lesion characteristic values of medical images according to claim 1, wherein the LOF algorithm comprises the following steps: step 1.1, calculating k nearest neighbors, and finding k nearest neighbor point sets N (i) of each point i; Step 1.2, calculating the reachable distance, and defining the reachable distance from the point j to the point i as dist (i, j) for the points i and j; if the point j is not in k nearest neighbors of the point i, the point j and the point i are not neighbors of each other, and at the moment, dist (i, j) is defined as infinity; step 1.4, calculating local reachable density, defining local reachable density as lrd (i) for the point i, and representing the reciprocal of the average reachable distance of the point i, namely lrd (i) =1/(sum (dist (i, j)/k), j epsilon N (i)); Step 1.5, calculating local anomaly factors, for a point i, defining local anomaly factors LOF (i) of the point i, representing the density proportion of the local anomaly factors relative to neighbor points, specifically, defining the local anomaly factors as the ratio of the density of neighbors around the point i to the density of the point i, namely LOF (i) =sum (lrd (j)/lrd (i), j epsilon N (i))/k; step 1.6, judging abnormal points, wherein according to the definition of LOF, if the value of LOF is larger, the points are more likely to be an abnormal point, and for each point i, if LOF (i) is larger than a certain threshold value, the points are considered to be abnormal values.
- 3. The method for eliminating abnormal values of focus features of medical image according to claim 1, wherein the quartile range algorithm comprises the following steps: step 1.7, sorting each characteristic value in the characteristic data according to the size; step 1.8, calculating a first quartile Q1, a median Q2 and a third quartile Q3 of the data set; step 1.9, calculating a value IQR of the quartile range, wherein IQR=Q3-Q1; Step 1.10, calculating a minimum value min_bound of abnormal value judgment, wherein min_bound=Q1-1.5 IQR; Step 1.11, calculating a maximum value max_bound of abnormal value judgment, wherein max_bound=Q3+1.5IQR; Step 1.12, checking whether the data is within the range of min_bound and max_bound, and if not, considering as an outlier.
- 4. The method for eliminating abnormal medical image lesion characteristic values according to claim 1, wherein in the step 4, the integrated model is a random forest model based on decision trees, and the principle is that a plurality of different decision trees are firstly constructed to form a random forest, and when the random forest is constructed, the establishment of each decision tree needs to be further optimized, and the method comprises the following steps: Step 4.1, selecting the characteristics, namely selecting a part of all the characteristics of a training set for construction for each decision tree, wherein the characteristics selected by each decision tree are different, and randomly extracting part of the characteristics from each decision tree for evaluation and selection; step 4.2, establishing decision trees, wherein each decision tree consists of a series of nodes, each node corresponds to a feature, dividing data through the feature, continuing to divide the data in sub-nodes, and setting the maximum depth or the minimum leaf node number of the decision tree; step 4.3, sampling samples, namely, calculating average purity after fitting, calculating the frequency of each feature used for splitting a data set in a random forest, and accordingly calculating average purity, wherein the lower the purity is, the larger the information quantity representing the feature is, and the greater the contribution to data classification or regression is; And 4.4, calculating the importance of the features, sorting all the features by calculating the average non-purity of each feature, determining the importance of each feature, screening representative features according to the importance of the features, reconstructing a random forest, fitting the representative features with the labels, and correcting the relationship between the abnormal patient features and the labels by using the fitted weight information to obtain corrected data.
- 5. The method of claim 1, wherein in step 6, the model filter model integrates glment, xgboost, ranger model filter features, partial features are filtered out, a classifier is used to fit a training set and predict a test set, and an AUC of the test set is calculated to indicate classification validity.
Description
Medical image focus characteristic value anomaly elimination method Technical Field The invention belongs to the technical field of machine learning and image histology, and particularly relates to a method for eliminating abnormal medical image focus characteristic values. Background With the development of image histology, it has become increasingly popular to extract a large number of features from medical images and integrate these features with clinical data for prognostic analysis to assist clinicians in developing more accurate and personalized medical strategies. The accuracy of the prognosis analysis plays a vital role in developing a precise individuation medical strategy by doctors so as to prolong the survival time of patients, however, the extracted image histology features are definitely very important for the prognosis analysis, but at present, the situations that the sketching of the patients is inaccurate or the extracted image histology features are abnormal due to the problems of software extraction and the like exist, how to eliminate the abnormality and ensuring the accuracy of a data source become the key for improving the accuracy of the prognosis analysis. Although there are several types of methods for outlier detection, such as detecting outliers based on statistics, detecting outliers based on clusters, detecting outliers based on densities, and detecting outliers based on proximity distances, it is necessary to consider whether outlier model detection is effective or not, whether the outlier model detection is generalized, and there is no specific method to prove the validity of outlier detection so as to achieve the purpose of eliminating the abnormality of the focus characteristic value caused by inaccurate medical image delineation. In view of the above, it is necessary to design a method for eliminating abnormal medical image lesion characteristic values and prove the effectiveness and generalization of the method to solve the above problems. Disclosure of Invention The invention aims to provide a medical image focus characteristic value abnormality elimination method and prove the effectiveness and generalization of the method. In order to achieve the above purpose, the invention provides a medical image focus characteristic value abnormality elimination method, which comprises the following specific operation steps: dividing a data set into a training set and a testing set according to a proportion, performing fitting detection on the training set by using an outlier detection model integrating density and statistics, and performing outlier detection on the testing set data by using a training set fitting model; Step 2, counting the abnormal characteristic quantity of each patient according to training set data and test set data by taking the patient as a unit, drawing a characteristic data abnormal quantity distribution diagram of the patient, determining an abnormal threshold value according to the characteristic data abnormal quantity distribution diagram, and determining the number of patients with the abnormal data larger than the threshold value and the number of patients with the normal data smaller than the threshold value; step 3, dividing the normal patient into a normal patient training set and a normal patient testing set according to the training set data and the testing set data; Marking labels according to a normal patient training set in training set data, normal patient training set in abnormal patient data and normal patient training set in test set data, and the labels of only data parts, namely the labels of the normal patient training set, are used for weak supervision by utilizing the thought of weak supervision, firstly, fitting the relation between the data of the normal patient and the labels by utilizing an integrated model, and using the weight relation learned in the normal patient to correct the relation between the abnormal patient data and the labels to obtain corrected labels; step 5, combining the normal patient training set and the abnormal patient training set in the training set data and the testing set data respectively to form an original training set, combining the normal patient training set and the data corrected by weak supervision to form a corrected training set, and taking the testing set of the normal patient as a corresponding testing set; And 6, judging whether the abnormality detection is effective or not by comparing the performance of the model fitted by the original training set on the testing set with the performance of the model fitted by the corrected training set on the testing set by utilizing a screening model and a classifier respectively for the training set and the testing set, evaluating the prediction capability of the model fitted by the training set after weak supervision and the model fitted by the original training set by taking the area under the curve as an AUC as an index, and if the AUC of the model fitt