CN-121983317-A - Biological age assessment method, system and application based on full life cycle DNA methylation
Abstract
The invention discloses a biological age assessment method, a biological age assessment system and application based on full life cycle DNA methylation, and belongs to the technical field of bioinformatics. Firstly, performing quality control and standardization treatment on acquired original DNA methylation data, then screening differential methylation sites obviously related to calendar age as characteristics, then, constructing a regression model by adopting a LightGBM gradient lifting frame by taking the calendar age as a target variable to obtain a methylation clock model, predicting the DNA methylation age of an individual by using the model, calculating an epigenetic age acceleration value based on the deviation between the DNA methylation age and the calendar age, and finally, performing correlation analysis on the age acceleration value and health or physiological indexes of different life stages to evaluate the biological aging state of the individual and predict related health risks. The invention covers the whole life cycle, is suitable for Chinese people, and can provide an effective tool for clinical disease risk prediction, health management and anti-aging intervention effect evaluation.
Inventors
- SHI XIAOMING
- QIAN HAO
- Lou Shihao
- LV YUEBIN
- CHEN JIAHAO
- CHEN CHEN
- WANG JUN
- LU YIFU
- SHI WENHUI
- MENG XI
- Zhang Zenghang
Assignees
- 中国疾病预防控制中心环境与健康相关产品安全所
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (10)
- 1. A method for biological age assessment based on full life cycle DNA methylation, comprising the steps of: S1, methylation data standardization preprocessing, namely performing quality control on the acquired original DNA methylation data, removing low-quality probes and samples, performing background correction, dye deviation correction and fractional number normalization by adopting ENmix packets, and correcting batch effect by using ComBat algorithm to obtain high-quality and standardized CpG sites A value matrix; S2, primary screening of aging-related CpG sites, namely screening differential methylation sites which are obviously related to calendar age through epigenomic correlation analysis EWAS and a linear regression model based on the data standardized in the step S1, wherein the screening standard is that the error discovery rate FDR is less than 0.01 and the absolute value of the age effect regression coefficient is more than 0.002; S3, constructing and optimizing a methylation clock, namely utilizing a machine learning algorithm, taking the methylation level of the difference CpG sites screened in the step S2 as a characteristic, taking calendar age as a target variable, adopting a LightGBM gradient lifting frame, and carrying out super-parameter optimization by combining 5-fold cross validation and Bayesian optimization to construct a regression model to obtain a methylation clock model; S4, performing epigenetic age acceleration calculation and definition, namely predicting DNA methylation age DNAmAge of an individual by using the methylation clock model constructed in the step S3, performing linear regression on calendar age through DNAmAge, taking regression residual as an epigenetic age acceleration value DNAMAGEDEV, wherein positive values represent accelerated aging, and negative values represent decelerated aging; And S5, verifying the methylation clock model constructed in the step S3 on an independent test set, calculating a pearson correlation coefficient R, a decision coefficient R2, a Root Mean Square Error (RMSE) and an average absolute error (MAE), and carrying out correlation analysis on the epigenetic age acceleration value obtained in the step S4 and health or physiological indexes of different life stages to evaluate the biological aging state of an individual, predict the risk of age-related diseases and evaluate the effect of anti-aging intervention measures.
- 2. The biological age assessment method according to claim 1, wherein step S1 is specifically: S11, obtaining peripheral blood samples of 5000-6000 Chinese crowd large-queue subjects 3-118 years old meeting inclusion standards, calculating calendar ages accurate to hours according to birth dates and group entering dates of the subjects, calculating actual years from birth to group entering by taking the birth dates as a reference, calibrating the calculated ages by adopting a multiple verification mode, and determining the age with the smallest deviation as the final calendar age; S12, performing full-apparent genome DNA methylation detection on the peripheral blood sample obtained in the step S1 by using Illumina Infinium Methylation EPIC v 2.0.0 BeadChip chip, performing quality control on raw detection data to realize standardization and ensure the quality of data for subsequent analysis, wherein the quality control on the raw detection data comprises performing background correction and dye bias correction by using ENmix R package, performing quantile normalization and probe design bias correction, excluding samples and probes with detection failure rate of >5%, excluding probes, non-CpG probes and cross-reactive probes positioned on sex chromosomes, using k-nearest neighbor method to interpolate deletion values, correcting batch effects generated by different detection times by adopting ComBat algorithm, estimating the proportion of six leukocyte subtypes of neutrophils, monocytes, B cells, natural killer cells, CD4+ T cells and CD8+ T cells based on methylation data, and reserving about 90 ten thousand high-quality CpG sites after quality control Values for subsequent analysis, quality control data The values exhibit the expected bimodal distribution.
- 3. The method for assessing biological age according to claim 2, wherein the screening of differential methylation sites significantly correlated with calendar age in step S2 is specifically performed by methylation after quality control obtained in step S12 And adopting Bei Jiemi Ni-Hokberg method to control error discovery rate, defining CpG sites with FDR < 0.01 and age effect regression coefficient absolute value > 0.002 as age-related differential methylation sites, and finally identifying 3237 differential methylation sites with obvious correlation with age.
- 4. The biological age assessment method according to claim 3, wherein the training process of the methylation clock model in the step S3 is that the overall data set is divided into a training set and an independent test set according to a ratio of 7:3, super-parametric tuning is performed on the training set by combining 5-fold cross validation with a Bayesian optimization strategy and taking a minimized mean absolute error MAE as a target, and performance assessment is performed on the model after tuning on the independent test set, wherein the assessment indexes comprise a Pearson correlation coefficient R, a decision coefficient R2, a root mean square error RMSE and an MAE.
- 5. The biological age assessment method according to claim 4, wherein in step S3, in order to further improve the interpretability of the model and reduce the feature dimension, a BorutaShap algorithm is used to perform feature selection on the pre-trained model, the BorutaShap algorithm performs importance comparison between the created shadow feature and the original feature, eliminates redundant CpG sites with importance not higher than random noise through multiple iterations, finally screens out a group of core feature sets with maximum contribution to age prediction, and contains 26 CpG sites, and retrains the final LightGBM model on the whole data set by using the feature sets and adopting a 5-fold cross validation strategy to ensure that the DNA methylation age of each individual is an extrasample predicted value, thereby obtaining the biological age estimator suitable for the whole life cycle.
- 6. The method of claim 5, wherein 26 CpG sites of step S3 are cg19283806、cg07553761、cg27099280、cg03890691、cg26614073、cg22551157、cg02378183、cg01124297、cg24866418、cg07582229、cg13206721、cg07023764、cg26638716、cg13984040、cg19076536、cg18902238、cg14558074、cg24455300、cg00481951、cg10351253、cg12773402、cg01885725、cg16604658、cg07927379、cg16867657、cg22454769.
- 7. The method of claim 6, wherein the step S4 is performed by using DNA methylation age deviation DNAMAGEDEV as a core age acceleration index, wherein the index is obtained by linearly regressing the DNA methylation age of the individual to the calendar age, retrieving the normalized residual as a calculated value, wherein the positive and negative values of the residual correspond to aging acceleration and the negative values correspond to aging deceleration, and wherein the aging phenotype classification is further performed based on the mean absolute error MAE of the final model, wherein the individual having DNAMAGEDEV > MAE is defined as a fast aging person, the individual having DNAMAGEDEV < -MAE is defined as a slow aging person, and the rest is a normal aging person.
- 8. The biological age assessment method according to claim 7, wherein in step S5, the epigenetic age acceleration value is correlated with health or physiological indicators of different life stages, specifically, the genetic age acceleration value is correlated with growth and development indicators and sex hormone-related indicators of young individuals of 3-18 years old, the cardiovascular metabolism risk indicators, liver enzymes and systemic inflammation-related indicators of adult individuals of 18-90 years old, and the genetic age acceleration value is correlated with debilitation indicators and cognitive function indicators reflecting health toughness and function maintenance of elderly individuals of 90 years old, so as to assess biological aging states and related health risks of different life stages.
- 9. A full life cycle DNA methylation based biological age assessment system comprising: a data acquisition module configured to acquire methylation level data of an individual biological sample at the 26 CpG sites of claim 6; the processing and calculating module is configured to run a trained machine learning model, takes methylation level data as input, and calculates and outputs DNA methylation age of an individual; An application output module configured to further calculate an epigenetic age acceleration index based on the DNA methylation age and generate a health risk assessment report.
- 10. The full life cycle DNA methylation based biological age assessment method according to any one of claims 1 to 8, for use in the preparation of a kit or detection device for assessing the biological aging status of an individual or predicting the risk of an age-related disease.
Description
Biological age assessment method, system and application based on full life cycle DNA methylation Technical Field The invention belongs to the technical field of bioinformatics, and relates to a biological age assessment method, a biological age assessment system and application based on full life cycle DNA methylation. Background The DNA methylation level shows regular variation with age, and an "epigenetic clock" constructed based thereon can be used to predict biological age. The prior art mainly comprises two types, namely, a first generation clock (such as Horvath, hannum) is used for constructing a model by screening CpG sites which are strongly related to calendar age, so that higher prediction accuracy is realized in adult population, and a second generation clock (such as PhenoAge) is used for further introducing clinical phenotypes as training labels, so that the relevance between the clinical phenotypes and disease risks is improved. However, the prior art still has the following technical defects that firstly, the age distribution of training data used for model construction is concentrated in 20-80 years old, so that the prediction errors of the training data in children, teenagers and elderly people over 90 years old are obviously increased, the application requirements of the whole life cycle cannot be met, secondly, the existing clock is constructed based on European and American crowd data, and the prediction errors exist when the existing clock is directly applied to Chinese crowds, so that the accuracy of the existing clock in local crowds is influenced, thirdly, the existing scheme lacks a standardized staged application method, and the quantized association of the specific health indexes of the same biological age index and different life stages cannot be established, so that the operability of the existing clock in actual health assessment is limited. Therefore, it is necessary to construct an epigenetic clock method which can stably predict at all ages and has a staged health assessment function based on the whole life cycle methylation data of Chinese population. Disclosure of Invention In view of this, in order to solve the problems that the existing DNA methylation clock technical scheme is not suitable for the chinese crowd and does not realize full life cycle coverage and staged evaluation application, the invention provides a biological age evaluation method, system and application based on full life cycle DNA methylation. In order to achieve the above purpose, the present invention provides the following technical solutions: a biological age assessment method based on full life cycle DNA methylation, comprising the steps of: S1, methylation data standardized pretreatment, namely performing quality control and standardization treatment on the obtained original DNA methylation data to obtain high-quality CpG sites; s2, primary screening of aging-related CpG sites, namely screening differential methylation sites which are obviously related to calendar age through full-epigenomic association analysis (EWAS) based on the high-quality CpG site data standardized in the step S1; S3, constructing and optimizing a methylation clock by using a machine learning algorithm, taking the methylation level of the difference CpG sites screened in the step S2 as a characteristic, taking calendar age as a target variable, constructing a regression model by using a LightGBM gradient lifting frame to obtain a methylation clock model finally used for predicting biological age, and carrying out characteristic selection on the pre-trained methylation clock model by using a BorutaShap algorithm, and screening a group of core minimum CpG site characteristic sets which have the greatest contribution to age prediction; S4, epigenetic acceleration calculation and definition, namely predicting the DNA methylation age of the individual by using the methylation clock model constructed in the step S3, and calculating an epigenetic age acceleration value based on the deviation of the DNA methylation age and the calendar age; And S5, verifying the prediction accuracy of the methylation clock model constructed in the step S3 in an independent test set, performing correlation analysis on the epigenetic age acceleration value obtained in the step S4 and specific health or physiological indexes of different life stages, and establishing a health risk assessment system of the life stages. To verify the inventive performance of the inventive model with Horvath, hannum, phenoAge clocks in the general population, the results showed that the inventive model's MAE (4.71 years) was significantly better than Horvath1 (MAE=6.38 years), horvath2 (MAE=4.89 years), hannum (MAE=5.06 years) and PhenoAge (MAE=6.79 years). Further, the step S1 specifically includes: S11, obtaining peripheral blood samples of 5000-6000 Chinese crowd large-queue subjects 3-118 years old meeting inclusion standards, calibrating calendar ages of particip