CN-121998963-A - Method for identifying real pulmonary nodules based on multi-modal data fusion
Abstract
The embodiment of the application relates to a method for identifying a pulmonary solid nodule based on multi-modal data fusion, which comprises the steps of obtaining multi-modal data of a pulmonary nodule, constructing a training set and a verification set of the multi-modal data, preprocessing blood samples, CT image data and clinical data in the multi-modal data according to the training set to sequentially obtain methylation characteristics, CT image characteristics and clinical characteristics, constructing a multi-modal fusion model according to the methylation characteristics, the CT image characteristics and the clinical characteristics, verifying the multi-modal fusion model according to the verification set, and obtaining the prediction probability of benign and malignant nodules through the multi-modal fusion model after verification.
Inventors
- XU QIN
- LI MINGMING
- PU JUE
Assignees
- 北京艾克伦医疗科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260205
Claims (10)
- 1. A method for identifying a solid lung nodule based on multi-modal data fusion, the method comprising: Acquiring multi-modal data of pulmonary nodules; Constructing a training set and a verification set of the multi-mode data; Preprocessing a blood sample, CT image data and clinical data in the multi-mode data respectively aiming at the training set, and sequentially obtaining methylation characteristics, CT image characteristics and clinical characteristics; constructing a multi-mode fusion model according to the methylation characteristics, the CT image characteristics and the clinical characteristics; verifying the multi-mode fusion model according to the verification set; And after the verification is passed, obtaining the prediction probability of benign and malignant nodules through the multi-modal fusion model.
- 2. The method according to claim 1, wherein said constructing the training set and the validation set of the multimodal data comprises: Acquiring multimodal data of solid nodule, pure ground glass nodule and mixed ground glass nodule patients; Respectively acquiring a first number of multi-modal data and a second number of multi-modal data from the multi-modal data, wherein the number of solid nodules, pure glass nodules and mixed glass nodules in the first number of multi-modal data accords with a first proportion, the number of solid nodules, pure glass nodules and mixed glass nodules in the second number of multi-modal data accords with a second proportion, and the first number is not equal to the second number; Constructing a training set according to the first multi-mode data; And constructing a verification set according to the second multi-mode data.
- 3. The method according to claim 1, characterized in that said obtaining methylation characteristics comprises in particular: According to a preset methylation panel, capturing, predicting and preprocessing a blood sample to be detected to obtain a Bam file, wherein the Bam file records the accurate position of each sequencing read compared to a reference genome, and determines whether a site is methylated or not and whether the methylation of the site is synchronously changed or not; Dividing methylation blocks based on a preset rule and the Bam file, wherein the number of CpG sites of cytosine-guanine dinucleotide sites in the same methylation block is more than 2, and the pearson correlation coefficient of any adjacent CpG sites is more than 0.5; calculating the number of methylation reads and the total number of reads of each CpG site in each methylation block according to the Bam file; Adding the methylation reads of all the sites, dividing the total reads of all the sites to obtain an average methylation level AMF of each methylation block; extracting each individual sequencing sequence from the Bam file; Scoring each sequenced sequence, wherein for a sequenced sequence having k CpG sites, the weight is a first value when all sites are unmethylated, a second value when all sites are methylated, and m/k when consecutive m sites are methylated; calculating a methylation haplotype load MHL according to MHL = Σ (weight per sequence)/total sequence number, wherein total sequence number refers to total number of all sequencing sequences covering the methylation block; Calculating the pearson correlation coefficient of each AMF or MHL characteristic and the benign and malignant label, and screening to obtain an initial methylation characteristic according to the pearson correlation coefficient; Methylation characteristics were determined from the initial methylation characteristics by the LASSO regression method.
- 4. The method of claim 3, wherein obtaining the Bam file further comprises: Obtaining a blood sample in the multimodal data of the lung nodule and a blood sample of a control; extracting cfDNA, constructing a sequencing library, and carrying out enzymatic conversion to obtain an amplified library product; sequencing the amplified library product in a high-throughput sequencer, and outputting an original FASTQ file, wherein the FASTQ file comprises a plurality of DNA sequences and quality information thereof; Processing the original FASTQ file, calling fastp to add UMI labels for paired reads, filtering bases with unsatisfactory quality, and generating the FASTQ file with UMI, wherein the bases with unsatisfactory filtering quality comprise removing joints by cutadapt tools, discarding reads with the length less than 50bp, wherein the first 12 bases of the Read2 are UMI; Calling BisMark a tool to compare the filtered paired reads to an hg19 human reference genome to generate an initial Bam file, wherein based on UMI labels, removing repeated sequences introduced by PCR amplification, sequencing the initial Bam file according to chromosome positions by Samtools, screening reads with comparison quality of >20, filtering reads with a C-T conversion rate of <95% at non-CpG sites, generating the Bam file, and establishing indexes.
- 5. The method of claim 1, wherein the obtaining CT image features specifically comprises: Acquiring a first region of interest (ROI) and a second ROI in CT image data respectively determined by a first terminal and a second terminal; Consistency verification is carried out on the first ROI and the second ROI through Kappa coefficients; After the verification is passed, determining a target ROI; Extracting a radiological feature of the target ROI; Calculating an intra-group correlation coefficient ICC value of the radiological features; Retaining initial radiology characteristics corresponding to ICC values greater than a first threshold; and calculating the variance of each initial radiological characteristic, comparing the variance with a preset second threshold value, and reserving initial CT image characteristics corresponding to the variance larger than the second threshold value as CT image characteristics.
- 6. The method of claim 5, wherein the CT image features include morphological features describing the geometry of the nodule, gray value features describing the distribution of pixel values within the ROI, texture features describing the spatial distribution and relationship of pixel gray values, including contrast and correlation, and higher order features that are features of post-filtering extraction of the image.
- 7. The method according to claim 1, wherein said obtaining clinical features comprises in particular: forming clinical data according to demographic information, medical history information, tumor markers, hematological parameters and derivative indexes; processing each variable in the clinical data to determine differences between the benign and malignant groups for each variable, the processing including using a T test or a Mann-Whitney U test for continuous variables, a chi-square test for classified variables, and a rank sum test for ordered variables; reserving variables with P values smaller than a third threshold value to obtain a first variable; screening a first variable through a minimum absolute shrinkage and selection operator LASSO regression and Logistic, and obtaining a second variable from the first variable; And carrying out binarization processing on the second variable according to a critical value to obtain a classification characteristic, wherein the classification characteristic is a clinical characteristic.
- 8. The method of claim 1, wherein constructing a multi-modality fusion model from the methylation signature, the CT image signature, and the clinical signature comprises: Z-score standardization processing is carried out on the methylation characteristics, normalization processing is carried out on clinical characteristics, and L2 regularization processing is carried out on CT image characteristics; Using the SHAP value to evaluate the SHAP values of the methylation characteristics, the CT image characteristics and the clinical characteristics after processing, and sorting according to the SHAP values, and selecting the first n characteristics; Performing low-rank decomposition on the first n features through a low-rank matrix to obtain decomposition results; Element product fusion is carried out through F= pi (M_i×W_i), and all modal decomposition results are fused, wherein pi is element level product operation, M_i is an ith modal feature matrix, and W_i is a modal weight matrix; And taking the logistic regression as a basic model, taking the support vector machine as a meta-model, stacking and integrating, and optimizing the multi-mode fusion model through a cross entropy loss function to obtain the multi-mode fusion model.
- 9. The method of claim 8, wherein the method further comprises: And verifying the overall performance of the multi-modal fusion model through an independent test set, wherein the overall performance passes through the area under ROC curve AUC, sensitivity, specificity and accuracy.
- 10. The method according to claim 1, wherein the method further comprises: Sequentially constructing a methylation characteristic model, a CT image characteristic model and a clinical characteristic model according to the methylation characteristic and the verification set, the CT image characteristic and the verification set and the clinical characteristic and the verification set: taking a plurality of methylation characteristics of each sample in a training set and labels corresponding to the methylation characteristics as inputs, training a methylation characteristic model, evaluating the performance of the methylation characteristic model by using 5-fold cross validation, adjusting parameters until the parameters are optimal, evaluating the methylation characteristic model by a test set, and finishing the training of the methylation characteristic model, wherein the AUC is more than or equal to 0.9; Taking a plurality of CT image characteristics of each sample in a training set and labels corresponding to the CT image characteristics as inputs, training a CT image characteristic model, evaluating the performance of the CT image characteristic model by using 5-fold cross validation, adjusting parameters until the parameters are optimal, evaluating the CT image characteristic model by a test set, and finishing training the CT image characteristic model, wherein the AUC is more than or equal to 0.9; According to clinical characteristics, constructing a clinical characteristic model specifically comprises the steps of taking a plurality of clinical characteristics of each sample in a training set and labels corresponding to the clinical characteristics as input, training the clinical characteristic model, evaluating the performance of the clinical characteristic model by using 5-fold cross validation, adjusting parameters until the parameters are optimal, evaluating the clinical characteristic model through a test set, and finishing the training of the clinical characteristic model, wherein the AUC is more than or equal to 0.9; And according to the output of the methylation characteristic model, the CT image characteristic model and the clinical characteristic model, a multi-mode fusion model is constructed in a fusion mode.
Description
Method for identifying real pulmonary nodules based on multi-modal data fusion Technical Field The application relates to the technical field of data processing, in particular to a method for identifying a lung solid nodule based on multi-mode data fusion. Background Lung cancer is the leading cause of cancer-related death worldwide, its prognosis is closely related to the disease stage at the time of diagnosis, the overall survival rate of stage IA patients is as high as 85% for 5 years, and stage IV patients are only 6%, so early diagnosis is important for improving cure rate and reducing mortality. The lung nodule is used as an early expression form of lung cancer, and the accurate identification of benign and malignant lung cancer is the key of clinical diagnosis and treatment decision, but the existing diagnosis method has obvious limitation. Currently, low-dose computed tomography (LDCT) has become a major means of lung cancer screening, reducing mortality by 20%, but has shortcomings in distinguishing benign from malignant nodules, 80% -90% of which are benign, which can easily lead to overdiagnosis and overdiagnosis, and the results are affected by nodule size, location and operator experience, with limited accuracy. Clinical assessment tools (such as a meo clinic model) rely on imaging parameters and risk factors, but the identification accuracy is not high, and the sensitivity is greatly affected by the characteristics of the nodules. Invasive diagnostic methods (e.g., bronchoscopy, percutaneous biopsy) can clarify the properties of the nodule, but have the risks of complications such as hemorrhage, pneumothorax, etc., and increase the pain and medical burden of the patient. Liquid biopsies are of interest as non-invasive detection means, where circulating free DNA (cfDNA) methylation detection is potentially diagnostic due to the specific changes that occur early in the tumor. However, single biomarker detection (such as CEA and SCC) has limited accuracy, is difficult to meet clinical requirements independently, and cannot fully characterize biological characteristics of the nodules when clinical data (such as age and smoking history) or imaging characteristics (such as CT radiological characteristics) are used independently, so that misdiagnosis and missed diagnosis risks exist. In theory, the multi-mode fusion strategy integrating cfDNA methylation, CT image characteristics and clinical data can improve diagnosis performance through complementary information, but the prior art faces challenges of large data structure difference, difficult characteristic screening, low fusion algorithm efficiency and the like, so that the model has insufficient interpretability and clinical applicability. Therefore, a technical scheme for efficiently integrating multi-modal data is developed, the limitation of a single method is solved, and the accurate identification of benign and malignant lung nodules in reality is realized, so that the problem to be solved in clinic is urgent. Disclosure of Invention The application aims at overcoming the defects of the prior art, and provides a method and a system for identifying a lung solid nodule based on multi-mode data fusion, so as to solve the problems in the prior art. In order to achieve the above object, the present application provides a method for identifying a solid nodule in a lung based on multi-modal data fusion, the method comprising: Acquiring multi-modal data of pulmonary nodules; Constructing a training set and a verification set of the multi-mode data; Preprocessing a blood sample, CT image data and clinical data in the multi-mode data respectively aiming at the training set, and sequentially obtaining methylation characteristics, CT image characteristics and clinical characteristics; constructing a multi-mode fusion model according to the methylation characteristics, the CT image characteristics and the clinical characteristics; verifying the multi-mode fusion model according to the verification set; And after the verification is passed, obtaining the prediction probability of benign and malignant nodules through the multi-modal fusion model. In one possible implementation manner, the constructing the training set and the verification set of the multi-modal data specifically includes: Acquiring multimodal data of solid nodule, pure ground glass nodule and mixed ground glass nodule patients; Respectively acquiring a first number of multi-modal data and a second number of multi-modal data from the multi-modal data, wherein the number of solid nodules, pure glass nodules and mixed glass nodules in the first number of multi-modal data accords with a first proportion, the number of solid nodules, pure glass nodules and mixed glass nodules in the second number of multi-modal data accords with a second proportion, and the first number is not equal to the second number; Constructing a training set according to the first multi-mode data; And constructing a veri