CN-121983138-A - Cow ketosis regulation gene prediction system based on multiple groups of chemical analysis and machine learning
Abstract
The invention provides a dairy cow ketosis regulation and control gene prediction system based on multiple groups of chemical analysis and machine learning, which belongs to the technical field of molecular breeding and disease prevention and control, and comprises a data acquisition and preprocessing module, a feature set building module, a machine learning model building module and a result evaluation module, wherein the data of a dairy cow genome, a dairy cow transcriptome and a dairy cow metabolome are integrated, candidate regulation and control genes are screened through whole genome association analysis, gene differential expression analysis, cis-eQTL positioning and co-positioning analysis, and a gene expression feature set is constructed; screening a core regulation gene by using a Lasso model through L1 regularization punishment, distributing weights, optimizing model parameters by combining grid search and cross validation, and evaluating the performance of the model through an ROC curve and an AUC value. The invention constructs an integrated technical system from gene screening to risk prediction, realizes efficient screening of core regulatory genes and accurate prediction of ketosis risk, and effectively reduces economic loss of cultivation.
Inventors
- HUANG HETIAN
- REN XIAOLI
- YAN LEI
- WANG YUEQIANG
- ZHANG SHIHENG
- LIANG DONG
- ZHANG ZHEN
- GAO TENGYUN
- SUN YU
- FU TONG
- LI YANGGUANG
Assignees
- 河南农业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260126
Claims (10)
- 1. The utility model provides a cow ketosis regulation and control gene prediction system based on multiunit chemical analysis and machine learning, its characterized in that includes data acquisition and preprocessing module, feature set establishment module, machine learning model establishment module and the result evaluation module that connect gradually, and each module is accomplished cow ketosis regulation and control gene prediction and verification in coordination, wherein: the data acquisition and preprocessing module comprises a phenotype collection unit, a genome data processing unit and a transcriptome data processing unit, and is used for collecting cow ketosis phenotype data, outputting a ketosis associated data set, processing multiple groups of chemical data, and outputting a basic database after quality control, comparison and quantitative analysis processing; the feature set establishment module comprises a whole genome association analysis unit, a gene differential expression analysis unit, a cis-eQTL analysis unit and a co-localization analysis unit; the machine learning model building module comprises a regulatory gene prediction unit, a Lasso model is built based on the screened candidate regulatory genes, the core regulatory genes are screened through L1 regularization punishment, weights are distributed, and grid search and cross verification are combined to optimize model parameters; And the result evaluation module evaluates the validity of the overfitting condition and ketosis prediction of the model by adopting an ROC curve and an AUC value.
- 2. The dairy cow ketosis regulation gene prediction system based on multiple sets of chemical analysis and machine learning according to claim 1, wherein the multiple sets of chemical data comprise dairy cow genome data, transcriptome data and metabolome data, wherein the genome data comprises SNP locus information, the transcriptome data comprises gene expression quantity information, the metabolome data comprises BHBA content information in blood, the phenotype data comprises dairy cow disease states, and the disease states are marked as binary labels, namely disease=case and health=control.
- 3. The dairy cow ketosis regulation and control gene prediction system based on multi-group analysis and machine learning according to claim 2 is characterized in that the classification standard of BHBA concentration is that BHBA concentration is less than or equal to 1.4mmol/L and healthy, 1.4mmol/L is less than or equal to 3mmol/L and subclinical ketosis, BHBA is more than 3mmol/L and clinical ketosis, and subsequent phenotype label marking and data set construction are carried out according to the ketosis classification result.
- 4. The system for predicting ketosis of dairy cows according to claim 1, wherein the quality control standard of the genome data is a marker with a single genotype passing rate lower than 95%, a single SNP genotype passing rate lower than 90%, a minor allele frequency MAF <0.05 of SNP and a deviation from Hardy-Weinberg equilibrium value P < 1.0X10 -6 .
- 5. The dairy cow ketosis regulation gene prediction system based on multiple sets of chemical analysis and machine learning according to claim 1, wherein the transcriptome data in the data acquisition and pretreatment module is processed by performing quality control on sequencing data, comparing the sequencing data with a reference genome ARS-UCD2.0, and performing quantitative analysis to obtain an annotation gene expression quantity matrix.
- 6. The dairy cow ketosis regulation and control gene prediction system based on multiple sets of chemical analysis and machine learning according to claim 1, wherein in the feature set establishment module, the analysis results of the whole genome association analysis unit, the gene differential expression analysis unit and the cis-eQTL analysis unit are output as an intermediate database, the co-localization analysis results performed by using the whole genome association analysis results and the cis-eQTL results are output as a gene expression feature set in combination with the gene differential expression analysis results, and the screened candidate genes are stored in the gene expression feature set.
- 7. The dairy cow ketosis regulation gene prediction system based on multiple sets of chemical analysis and machine learning according to claim 1, wherein the working steps of the co-localization analysis in the feature set establishment module are as follows: Step S1, basic data arrangement is carried out, and basic information of samples is arranged, including but not limited to pathological states and gene expression quantities; S2, carrying out whole genome association analysis, taking a ketosis associated phenotype as a dependent variable, taking the pretreated SNP locus as an independent variable, calculating the association strength P value of each SNP locus and the ketosis phenotype, and screening the obviously associated SNP locus; s3, screening TOP genes with expression differences under different ketone pathology states, wherein the ketone disease classification phenotype is a dependent variable, the expression quantity of the genes is an independent variable; s4, performing cis-eQTL analysis, taking the difference of the expression amounts of all genes as dependent variables, taking the pretreated SNP locus as independent variable, setting the distance threshold of cis-expression quantitative trait loci as 1MB, and screening the obvious cis-eQTL related to the gene expression change; S5, performing co-localization analysis, integrating a GWAS result with the matched eQTL information for correlation verification, and quantifying the possibility of regulating and controlling the ketosis phenotype and the gene expression while quantifying the SNP locus by setting up a hypothesis to calculate posterior probability; S6, judging that the SNP loci have a co-localization phenomenon, recording that the corresponding SNP loci are target SNP loci which affect the ketosis phenotype and gene expression of the dairy cows at the same time, and taking genes associated with the target SNP loci as candidate regulation genes; And S7, establishing a gene expression characteristic set, integrating the GWAS, cis-eQTL and a co-localization analysis result, and constructing the gene expression characteristic set through gene annotation, metabolic pathway and function enrichment analysis.
- 8. The system for predicting ketosis of dairy cows according to claim 1, wherein the working steps of the regulatory gene prediction unit in the machine learning model are as follows: Dividing the feature set into a training set and a verification set by adopting a random sampling mode to ensure that two groups of data are distributed consistently, wherein the training set is used for model training, feature screening and parameter optimization, and the verification set is used for evaluating the performance of the model; and T2, constructing a Lasso regression model based on the training set, wherein the model formula is as follows: ; ; Wherein, the For a set of vectors of possible weights for each characteristic gene, In order to obtain a sample size of the sample, The samples are numbered and the number of the samples, Is the first True labels of individual samples, illness = Case, health = Control, Model predictive first The probability that the individual samples are of a positive class, The intensity is punished for the L1, Is the number of the characteristic genes, Is the first The regression coefficient of the individual genes, In order to be an intercept of the beam, Is the first The gene expression vector of each sample, Transpose the symbol; And step T3, performing feature selection on the candidate regulatory genes through L1 regularization punishment, wherein a punishment formula is as follows: ; Wherein, the Penalty term for L1; And step T4, automatically compressing and rejecting the gene weight which does not contribute to ketosis regulation as 0, and optimizing the regularization parameters of the model by adopting a grid searching method and a 5-time 10-fold cross validation method, wherein the grid searching method comprises the following formula: ; Wherein, the For the candidate regularization parameter set, For the number of points of the grid, The grid points are numbered, The corresponding L1 penalty intensity for the grid minimum, For the corresponding L1 penalty strength at grid maximum, Is a linear equidistant cutting point, and the cutting point is a linear equidistant cutting point, ; The 5-time 10-fold cross validation formula is: ; Wherein, the In the form of a health status tag, For given purposes Is used for the cross-validation of the (c) average cross-validation performance, For the number of repetitions of this, the number of repetitions, In order to make the number of folds, Is the first Repeating for the second time, the first Folded AUC values; and T5, selecting optimal L1 punishment intensity, wherein the selection rule is as follows: ; Wherein, the For the optimal penalty strength, For given purposes The average AUC obtained over the cross-validation, To get the lead The maximum penalty strength is reached.
- 9. The dairy cow ketosis regulation gene prediction system based on multiple groups of chemical analysis and machine learning according to claim 1, wherein the result evaluation module performs system evaluation on the result output by the machine learning model building module, the core surrounds two targets of model overfitting condition and ketosis prediction effectiveness, and ROC curve and AUC value are adopted as core evaluation indexes, and the specific steps comprise: step W1, performing model overfitting and ketosis prediction effectiveness evaluation, drawing ROC curves based on a training set and a verification set respectively, and calculating AUC values, wherein the ROC curve formula is as follows: ; ; Wherein, the As the current threshold value is set, , Is the first Model predictive probabilities for the individual samples are determined, Is the first True label of individual samples, illness = Case, health = Cnotrol, To indicate the function when Taking 1 if not, taking 0 if not; The calculated AUC formula is: ; Wherein, the Is a horizontal axis variable, For giving an FPR value, return the corresponding threshold value , Is a threshold value The true rate of the product is set to be, Is integrated from 0 to 1 along the FPR; and step W2, comparing the AUC values of the training set and the verification set to judge whether the fitting is performed or not, and evaluating the ketosis prediction effectiveness based on the AUC values of the verification set.
- 10. The system for predicting ketosis of dairy cows according to claim 1, wherein the evaluation criteria of the result evaluation module are as follows: The judgment of the overfitting is that the difference of AUC values of a training set and a verification set is within +/-0.1, which indicates that the model generalization capability is strong; And (3) judging the prediction effectiveness, namely comparing ROC curves and AUC values of the training set and the verification set, wherein the AUC value of the verification set is more than or equal to 0.8, and judging that the ketosis can be effectively predicted based on the core regulatory gene.
Description
Cow ketosis regulation gene prediction system based on multiple groups of chemical analysis and machine learning Technical Field The invention relates to the technical field of molecular breeding and disease prevention and control, in particular to a dairy cow ketosis regulation gene prediction system based on multiple sets of chemical analysis and machine learning. Background Ketosis of cows is a metabolic disease of high incidence of perinatal cows, and the core cause is abnormal catabolism of fat caused by negative energy balance in vivo. The sick cows can have symptoms of anorexia, milk yield reduction, weight loss and the like, can cause abortion, infertility and even death when serious, can induce complications such as mastitis, metritis and the like, and causes huge economic loss for cow breeding industry. At present, related researches are focused on application aspects of clinical diagnosis, nutrition regulation and the like, and research on molecular mechanisms of occurrence and development of ketosis is relatively lagged. The regulation genes play a key role in the ketosis disease process, screening and identifying related regulation genes is a core premise of revealing pathogenesis, developing early diagnosis markers and formulating accurate prevention and control measures, and the regulation genes are based on the regulation genes to realize early prediction of ketosis risk, so that the regulation genes are key means for reducing morbidity and improving cultivation benefits. In the prior art, gene screening mainly depends on single methods such as whole genome association analysis (GWAS), expression quantitative trait locus (eQTL) analysis and the like, and disease prediction mainly depends on clinical phenotype indexes. However, the technologies have obvious limitations that the GWAS is difficult to effectively distinguish causal variation from linkage disequilibrium variation, so that the gene screening accuracy is insufficient, and the traditional disease prediction method is easy to be interfered by external factors such as feeding environment, management level and the like, and has limited prediction accuracy. Although co-localization analysis can integrate multiple groups of chemical data, so that the accuracy of causal gene screening is improved, the Lasso model can realize the combination of feature screening and predictive modeling through L1 punishment, and has unique advantages in biological data mining, no technology is available at present for integrating the co-localization analysis into a database module, and an integrated system for screening and regulating genes firstly through the Lasso model and then realizing ketosis risk prediction is also lacking, so that the prior art cannot efficiently complete the collaborative process of gene screening-risk prediction, and the actual requirement of accurate prevention and control is difficult to meet. Disclosure of Invention The invention aims to provide a cow ketosis regulation and control gene prediction system based on multiple groups of chemical analysis and machine learning, which is characterized in that the co-localization analysis is deeply integrated in a feature set establishment module, multiple groups of chemical data such as genome and transcriptome are integrated, and the GWAS and eQTL analysis and the co-localization analysis are combined, so that the problem that causal variation pain points are difficult to distinguish by a single gene screening method is solved, the screening accuracy of candidate regulation and control genes is obviously improved, meanwhile, a Lasso model is adopted, invalid gene elimination, core gene weight distribution and prediction model construction cooperation is realized through L1 regularization penalty, the limitation that the traditional prediction method is easily influenced by environment is broken through, and the efficient cooperation flow from regulation and control gene screening to ketosis risk prediction is realized. In order to achieve the above purpose, the invention provides a cow ketosis regulation gene prediction system based on multiple groups of chemical analysis and machine learning, which comprises a data acquisition and preprocessing module, a feature set building module, a machine learning model building module and a result evaluation module which are sequentially connected, wherein each module is used for cooperatively completing cow ketosis regulation gene prediction and verification, and the method comprises the following steps: the data acquisition and preprocessing module comprises a phenotype collection unit, a genome data processing unit and a transcriptome data processing unit, and is used for collecting cow ketosis phenotype data, outputting a ketosis associated data set, processing multiple groups of chemical data, and outputting a basic database after quality control, comparison and quantitative analysis processing; the feature set establishment module compris