CN-122024809-A - Method for constructing colorectal cancer intelligent prediction model based on mass spectrum seroproteomics
Abstract
The invention discloses a method for constructing an intelligent colorectal cancer prediction model based on mass spectrum seroproteomics, candidate micro peptides with diagnostic potential are screened out by combining high-precision mass spectrum quantification with AI function prediction, and an integrated learning algorithm is further adopted to construct a model. The method has strict flow, and the robustness of biomarker screening and the accuracy of a prediction model are obviously improved by introducing an advanced normalization algorithm, an integrated feature selection strategy and targeted sample imbalance treatment. Meanwhile, theoretical support is provided for clinical application of the marker through model explanatory analysis.
Inventors
- LIN AIFU
- Su Xinwan
- SHI CHENGYU
- WANG ZHIZHUO
- HAN MENGYU
- CHENG GUANGQIAN
Assignees
- 浙江大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251120
Claims (10)
- 1. The method for constructing the colorectal cancer intelligent prediction model based on mass spectrum seroproteomics is characterized by comprising the following steps of: (1) Obtaining original micropeptidics data, namely removing disulfide bonds in the micropeptides through traditional mass spectrum pretreatment after extracting low-abundance micropeptides from each blood sample to be detected through ultrafiltration and reverse enrichment, performing methylation sealing on sulfhydryl groups, obtaining a peptide fragment mixture derived from the blood micropeptides through enzymolysis, and obtaining original mass spectrum data of the peptide fragment mixture through a high-resolution liquid chromatography-mass spectrum combined technology; (2) Preprocessing original mass spectrum data to eliminate the difference between technical repetition, removing a matching result of a reverse sequence library for evaluating false positive rate in mass spectrum database retrieval, and correcting the parent ion intensity of a peptide fragment by adopting a delay normalization algorithm, wherein the algorithm can effectively remove systematic deviation among samples; then, collecting parent ionic strength information of a peptide segment layer to a micro-peptide layer, designing a double screening strategy based on a missing value mode, identifying candidate micro-peptides, automatically selecting a proper statistical test method according to data distribution characteristics, and applying FDR correction to control multiple test false positive rates to obtain a micro-peptide-sample quantitative matrix; (3) The characteristic screening comprises the steps of screening a micro peptide marker with the most distinguishing capability, dividing a micro peptide-sample quantitative matrix into a training set and a test set by adopting a layered random sampling strategy, carrying out pretreatment of interpolation of a missing value and standardization of a Z-score by the training set, learning interpolation values and standardization parameters, applying the learned parameters to conversion of the training set and the test set, ensuring fairness of model generalization capability assessment, applying an SMOTE algorithm to treat class imbalance to the training set, applying a low variance filtering method to the balance training set after SMOTE treatment to remove micro peptides with almost unchanged expression quantity in all samples, and then comprehensively sequencing all the micro peptides by adopting a recursive characteristic elimination method combined with cross verification and combining evaluation results of a plurality of machine learning algorithms and information theory methods, thereby screening a stable micro peptide marker combination with obvious diagnosis accuracy and development potential; (4) Model training and verification, namely constructing a classification prediction model based on the screened micro peptide marker combination, adopting SMOTE on a training set to avoid model deviation to a plurality of types in particular aiming at the common type imbalance problem in clinical samples in the model training stage, improving the identification capability of cancer samples, and selecting an optimal model by comparing the performances of a plurality of machine learning models on an independent verification set, wherein the optimal model finally performs performance verification on an independent test set so as to evaluate the generalization capability and diagnosis accuracy of the optimal model in real world scenes.
- 2. The method of claim 1, wherein the blood sample of the subject in step (1) comprises a colorectal cancer sample and a normal control sample, and the sample inclusion criteria comprises a pre-treatment blood sample of a pathologically diagnosed gastrointestinal tumor patient, an age/sex matched healthy control sample, and written informed consent of the donor for all samples.
- 3. The method of claim 1, wherein step (1) is performed with mass spectrometry after pretreatment of each 300 μl of blood sample by: a. taking 300 mu L of serum from the sample to a 10KD ultrafilter tube, and centrifuging with 10000g of the serum at 4 ℃ for 90min; b. transferring the supernatant in the step a to a new EP tube, adding 200 mu L HIGH SELECT Top14 high-abundance protein removal resin, incubating for 30min at room temperature, centrifuging at 12000rpm for 1min at full speed, and transferring the supernatant to the new EP tube; c. taking 80 mug of supernatant of total protein, adding an equal volume of acetone, uniformly mixing, adding-20 ℃ precooled acetone of 4 times of the supernatant, and standing at-20 ℃ for 3 hours; d. C, centrifuging the suspension obtained in the step (c) for 15min at 13000 rpm; e. step d the pellet was dissolved by adding 300. Mu.L of digestion buffer at 50mM N at pH8.0 and incubated overnight at 37 ℃ HC Dissolving endoprotease Lys-C and sequencing grade trypsin in a buffer solution to prepare the protein, wherein the mass ratio of the endoprotease to the total protein in the step C is 100:1, and the mass ratio of the trypsin to the total protein in the step C is 50:1; f. Adding 1.5. Mu.L of 5mM TCEP aqueous solution and 6. Mu.L of 11mM IAM aqueous solution to the 300. Mu.L system of the reaction of the step e, incubating for 1 hour at 37℃in the dark, and then adjusting pH <2 using 50% TFA aqueous solution by volume to terminate the digestion; g. The product after the end of the digestion in step f was dried in vacuo at 37℃at 4000rpm using an Alpha 2-4 LSCbasic freeze dryer and then redissolved in about 30. Mu.L of 0.1% TFA aqueous solution by volume; h. loading the cut C18 solid phase extraction membrane with volume of about 2cm 3 into a chromatographic column, adding 20 μl acetonitrile, wetting the filler sufficiently, and centrifuging at 600×g for 1 min; i. Step h, adding 20 mu L of 50% ACN solution containing 0.1% TFA, adjusting the packing environment, and centrifuging at 800 Xg for 1 min, wherein the 50% ACN solution containing 0.1% TFA is prepared by mixing 500 mu L of aqueous solution of TFA with volume concentration of 10%, 25ml of 100% acetonitrile and 19.5ml of double distilled water; j. Step i) add 20 μl of 0.1% TFA in water, centrifuge at 1000×g for 1 min, repeat 2 times to ensure that the filler is in a state suitable for target binding; k. Centrifuging the dissolved sample in the step g at 12000rpm at a high speed, taking supernatant, adding the supernatant into the chromatographic column in the step j, binding the target substance to the filler, and centrifuging at 600 Xg for 2 minutes; step k, adding 0.1% TFA aqueous solution, centrifuging for 1min at 1000 Xg, repeating for 2-3 times, and removing impurities which are not specifically adsorbed; step L50 μl of 40% ACN solution containing 0.1% TFA was added, centrifuged at 600×g for 1 min, the target eluted and the eluate collected; n. step m eluate was vacuum dried using Alpha 2-4 LSCbasic freeze dryer at 37℃at 4000rpm, and then redissolved in 5-10. Mu.L of 0.1% aqueous TFA to give a solution for mass spectrometry detection.
- 4. The method of claim 1, wherein the acquisition conditions of the mass spectrometry data of step (1) are: the chromatographic separation uses a C18 reversed phase chromatographic column with an inner diameter of 75 mu m, an outer diameter of 360 mu m, a length of 150 mm and a filler particle diameter of 2 mu m, each sample is sampled twice, gradient elution is carried out by adopting a constant flow rate of 300 nL/min, the elution time is 70 minutes, the mobile phase A is water/0.1% formic acid, the mobile phase B is 80% acetonitrile/0.1% formic acid, the elution gradient is set as follows, the initial B phase concentration is 2%, the linear increase is carried out within 58 minutes to 28%, then the linear increase is carried out within 65 minutes to 35%, and finally the linear increase is carried out within 70 minutes to 98%; the mass spectrum detection uses Thermo Q Exactive HF-X mass spectrometer and is controlled by Xcalibur 4.1 software to operate in a data-dependent acquisition mode, the acquisition sequence starts from a full-scan mass spectrogram, the scanning range is 350-1800 m/z, the resolution is 60,000, 20 times of MS/MS scanning with dependence is performed, the collision energy is 30%, the automatic gain control target value is set to 3e6, the maximum injection time is 50 milliseconds, and the resolution of the MS2 spectrogram is set to 15,000.
- 5. The method of claim 1, wherein the preprocessing in the step (2) is to perform preliminary analysis by using Thermo Xcalibur Qual Browser and Proteome software, firstly perform pattern matching on file names, combine data in all technical repeated files belonging to the same biological sample, and unify numbers to form a data table, the correction method is to group original parent ion intensity values ms1_int_sum of each peptide segment in each sample, select an optimal.minize function in a SciPy library, select SLSQP or L-BFGS-B algorithm as a solver, calculate a unique normalization factor for each sample, multiply the original parent ion intensity values by the corresponding normalization factors, obtain corrected parent ion intensity values ms1_int_sum_apex_dn, collect parent ion intensity information of peptide segment layers to the micro-peptide layers according to micro-peptide identifiers, perform grouping on the data according to the micro-peptide identifiers, perform correction on the corrected parent ion intensity values ms 1_int_dm of all peptide segments belonging to the same micro-peptide in the same sample, perform quantitative distribution on the data before the final value is calculated for each sample, and the final value is more than the final value is calculated by adopting a quantitative strategy, and the final abundance value is added to the final value of the normalized value, and the abundance value is more than the final value is calculated by adopting a quantitative strategy.
- 6. The method of claim 1, wherein the step (2) double screening strategy is 1) selecting proteins with high detection rates in at least one group and with a deletion rate of <50%, and 2) identifying proteins with significant differences in detection rates between groups, filling the deletion values with one tenth of the group mean, and maintaining the data distribution characteristics.
- 7. The method of claim 1, wherein step (3) comprises a 70% training set and a 30% validation set.
- 8. The method of claim 1, wherein the low variance feature filtering in step (3) is performed by eliminating redundant micropeptid features with minimal variation in expression level in all samples, performing preliminary dimension reduction by using a feature_selection.variancethreshold module in a Scikit-learn library, performing hierarchical cross-validation by using a feature_selection.RFECV module in a Scikit-learn library and performing iterative cross-validation by using an extensive_histogram module as an evaluator and a model_selection.StratifidKfield module, and determining an optimal feature subset by iteratively removing the least important features, and the training set sample balancing is performed by using an over_sampling.SMOTE module in a Imbalanced-learn library only for the training set, and performing SMOTE algorithm by generating composite samples to balance the number of positive and negative categories.
- 9. The method of claim 1, wherein the multi-model training and selecting method in step (4) is to train a plurality of classifiers on the balanced training set using the selected marker features using an ensable. Random forest classifier module in the tool Scikit-learn library, a linear_ model.LogisticRegression, svm.SVC module and a XGBClassifier module in the XGBoost library.
- 10. Use of an intelligent predictive model of colorectal cancer constructed by the method of claim 1 to assist in colorectal cancer diagnosis.
Description
Method for constructing colorectal cancer intelligent prediction model based on mass spectrum seroproteomics Field of the art The invention relates to a method for constructing an intelligent colorectal cancer prediction model based on mass spectrum seroproteomics. (II) background art China is in the key period of economic rapid development and social transformation, and with the acceleration of industrialization and urbanization progress, the life style and dietary structure of residents are obviously changed, and phenomena such as high-salt and high-fat diet, irregular work and rest and the like are increasingly common. At the same time, the additive effects of working pressure and environmental factors lead to a continuous rise in the risk of chronic diseases. Clinical data indicate that the five-year survival rate of early gastric cancer (stage I) patients can reach more than 90%, while the later (stage IV) patients suddenly drop to less than 30%. Therefore, establishing an efficient, accurate and generalized early gastric cancer screening system has become a key problem to be solved in the current public health field of China. At present, the diagnosis of early colorectal cancer mainly faces two technical bottlenecks, namely, on one hand, early lesions are limited to mucous membrane layers and are difficult to identify in conventional imaging examination, and on the other hand, the existing screening means have the defects that the accuracy of endoscopic examination serving as a gold standard is high, the coverage rate in crowd screening is only about 20 percent due to the characteristics of high invasiveness, high equipment requirement, high cost and the like, and the accuracy of Pepsinogen (PG) detection is limited although the operation is simple and convenient. The high-accuracy technology is difficult to popularize, the technology is easy to popularize and the technology is low in accuracy, and the popularization effect of the gastric cancer early screening is severely restricted. In recent years, the liquid biopsy technology featuring minimally invasive and convenient features provides a new idea for breaking through the dilemma. However, the currently mainstream methylation detection method of circulating tumor DNA (ctDNA) still has many challenges, namely, the problems of extremely low ctDNA content in peripheral blood, easy degradation of samples, large background noise interference and the like, so that the false negative rate (about 30%) and the false positive rate (about 15%) of the detection are high, and the clinical application value of the detection is limited. In this context, the tripeptide (micropeptides) exhibits unique advantages as a novel biomarker. The functional small molecule polypeptide coded by the non-coding RNA has the characteristics of (1) small molecular weight, stable structure, more stability in blood, less possibility of being polluted, easier detection and stronger degradation resistance, (2) high tissue specificity, the expression profile of which is closely related to the occurrence and the development of colorectal cancer, and (3) higher detection sensitivity in peripheral blood compared with the traditional protein marker. Therefore, the development of a novel high-performance detection method based on the micro peptide is expected to break through the technical bottleneck of the current stomach cancer early screening, and provides key technical support for establishing a classification screening system suitable for China. (III) summary of the invention The invention aims to provide a method for constructing an intelligent colorectal cancer prediction model based on mass spectrum seroproteomics, which screens candidate micro peptides with diagnostic potential by combining high-precision mass spectrum quantification with AI function prediction and further adopts an ensemble learning algorithm to construct the model. The method adopts an optimized sample pretreatment flow, comprises ultrafiltration, reverse enrichment of high-index serum classical protein immunoprecipitation and pancreatin digestion, so that the proportion of the micro peptide in the sample is obviously improved, further the detection stability and repeatability of the target micro peptide are ensured, the established model has the characteristics of simplicity and convenience in operation, no wound, high specificity and the like, achieves the clinical application standard of early colorectal cancer screening, can effectively overcome the problems of poor compliance of the traditional endoscopy, high false positive rate of the traditional liquid biopsy technology and the like, provides an innovative solution for early colorectal cancer screening, is hopeful to break through the technical bottleneck of the current early gastric cancer screening, and provides key technical support for the improvement of a hierarchical diagnosis and treatment system in China. The technical scheme adopted by the inven