Search

CN-122024833-A - Method and system for predicting wetland pine wood property related characters by combining marker preselection and genome selection based on whole genome association analysis

CN122024833ACN 122024833 ACN122024833 ACN 122024833ACN-122024833-A

Abstract

The invention discloses a method and a system for predicting wetland pine wood property related characters based on whole genome association analysis marker preselection combined with genome selection, wherein the method comprises the following steps of (1) collecting a phenotype data set and a genotype data set of a wetland pine sample target character as a sample set; the method comprises the steps of (1) constructing and determining a basic model, (3) performing whole genome association analysis on target characters to obtain a significant association SNP set and a marker subset, (4) constructing a prediction model and training and testing to obtain an optimal prediction model, and (5) predicting the wetland pine wood property related characters by using the optimal prediction model. The system comprises a data acquisition module, a quality control module, a correlation analysis module, a mark preselection module, a model construction module, an evaluation module and a result output module. The invention can solve the problem of poor GS prediction performance aiming at the property related to the materials such as the wet pine pulp.

Inventors

  • DING XIANYIN
  • LUAN QIFU
  • Diao shu

Assignees

  • 中国林业科学研究院亚热带林业研究所

Dates

Publication Date
20260512
Application Date
20260410

Claims (10)

  1. 1. The method for predicting the wetland pine wood property related property combined with genome selection based on the marker preselection of the whole genome association analysis is characterized by comprising the following steps: Step (1), acquiring a phenotype data set and a genotype data set of a target property of a pine sample of the wetland as a sample set; calculating a narrow-sense genetic force corresponding to the target property, and evaluating fitting degrees of different prediction models to determine a basic model; Dividing a sample set into a training set and a verification set by adopting a k-fold cross verification mode, performing full genome linkage disequilibrium analysis and full genome association analysis on a target character based on the training set, and identifying single nucleotide polymorphism marker loci which are obviously related to the target character to obtain a obvious association SNP set; Constructing a genome selection prediction model based on the marker subset and the basic model, performing model training by taking phenotype data of the marker subset and the training set as input, and performing model test by adopting the verification set to obtain an iterative intermediate model; step (5), repeating the step (3) and the step (4) until k iterations are completed, summarizing the prediction capability of the k iteration intermediate models, and selecting an optimal GS model; and (6) outputting genome predictive breeding values for target individual wetland pine or candidate breeding materials by utilizing an optimal GS model, and predicting the wetland pine wood property related characters.
  2. 2. The method for predicting the characteristics associated with the characteristics of the pine wood in the wet land combined with the genome-wide correlation analysis based mark preselection according to claim 1, wherein in the step (1), the characteristics of the wet land pine sample target are characteristics of pulp wood, and the phenotype data include chest diameter, basic wood density, fiber length, fiber width, fiber aspect ratio, fiber kink index, fiber total kink angle, fiber curl index, fiber fines percentage, cellulose content, lignin content and hemicellulose content.
  3. 3. The method for predicting the characteristics related to the pine wood quality of the wetland based on the marker preselection and the genome selection of the whole genome correlation analysis, which is disclosed by claim 1, is characterized in that in the step (1), the acquisition method of the phenotype data set is that the phenotype data of the target characteristics of the wetland pine sample are acquired and standardized by adopting Z fraction, so as to obtain the phenotype data set, the acquisition method of the genotype data set is that the single nucleotide polymorphism marker data in the whole genome range of the individual wetland pine sample are acquired and subjected to quality filtration, and the quality filtration method of the single nucleotide polymorphism marker data is that the captured SNP sites are filtered by VCFtools, so that the SNP sites with the minor allele frequency lower than 0.01 and the deletion rate higher than 20% are excluded.
  4. 4. The method for predicting the characteristics related to the pine wood property of the wetland based on the mark preselection of the whole genome association analysis and the genome selection according to claim 1, wherein in the step (2), a genealogy genetic relationship matrix is built based on genealogy information, an optimal linear unbiased prediction model based on the genealogy is built, a genome relationship matrix is built based on genotype data, and an optimal linear unbiased prediction model based on the genome is built; The formulas of the optimal linear unbiased prediction model based on the pedigree and the optimal linear unbiased prediction model based on the genome are as follows: (2); In the formula (2), the amino acid sequence of the compound, Is a vector of unimorph phenotype observations; an incidence matrix that is a fixed effect; Is a fixed effect vector; An incidence matrix that is an additive genetic effect; is an additive genetic effect vector; Is a residual effect vector; The estimation method of the narrow genetic transmission of the characters is as follows: (3); in the formula (3), the amino acid sequence of the compound, Is a narrow genetic value of the character; Is an estimated additive genetic variance; is the phenotypic variance; Comparing the goodness-of-fit relative quality of the two prediction models by adopting a red pool information quantity criterion and a Bayesian information quantity criterion, and selecting Values and/or The prediction model with smaller value is taken as a basic model: (4); (5); In the formulas (4) and (5), The information quantity criterion value is red pool information quantity; is a Bayesian information quantity criterion value; for log-likelihood values under a restrictive maximum likelihood estimation, Is the number of estimated parameters that are to be used, Is the observed quantity.
  5. 5. The method for predicting the characteristics of the wetland pine wood property in combination with genome selection based on the marker preselection of the whole genome correlation analysis according to claim 1, wherein in the step (3), k is equal to 10, the first three principal components extracted by the principal component analysis method are taken as covariates into a model when the whole genome correlation analysis is performed in a training set, a significance threshold is determined based on Bonferroni correction, and a cutoff P value is set to 1.57E-06.
  6. 6. The method for predicting the characteristics of wetland pine wood property based on marker preselection and genome selection based on whole genome correlation analysis according to claim 5, wherein in the step (3), the random marker subset is a set of SNPs formed by randomly selecting a preset number of SNP markers from a set of significantly correlated SNPs, the subset of significantly markers is a set of SNPs formed by selecting the first N SNPs according to the significance ordering of whole genome correlation analysis, N is a preset number of SNPs, and the subset of large effect markers is a set of SNPs formed by selecting SNPs satisfying a preset PVE threshold condition.
  7. 7. The method for predicting the characteristics related to the characteristics of the wetland pine wood based on the marker preselection of the whole genome correlation analysis according to claim 6, wherein in the step (4), the genome selection prediction model is the genome optimal linear unbiased prediction based on a mixed linear model or is based on a Bayesian linear regression model, the genome selection prediction model is constructed by adopting a random marker subset or a significant marker subset, and the large effect marker subset is taken as a fixed effect into the genome selection prediction model based on the Bayesian linear regression model which is Bayesian Ridge Regression model, bayesian Lasso model, bayesian A model, bayesian B model or Bayesian C model; The screening method of the large effect marker subset comprises the steps of firstly selecting SNP with PVE value larger than or equal to 0.0125, then respectively generating five large effect SNP subsets of 0.0125-0.025, 0.0125-0.050, 0.0125-0.075, 0.0125-0.100 and 0.0125-0.125 by utilizing different PVE upper limit thresholds, and only including SNP falling within the range of the designated PVE as a fixed effect covariate into a genome selection prediction model for each interval.
  8. 8. The method for predicting the characteristics of the pine wood property of the wetland in combination with the genome selection based on the marker preselection of the whole genome correlation analysis according to claim 7, wherein in the step (5), the prediction capability of the genome selection prediction model under the condition of different marker subsets is evaluated by calculating the pearson correlation coefficient between the genome estimated breeding value obtained by the iteration intermediate model prediction and the actual phenotype data, and an optimal model configuration scheme is selected according to the prediction capability result, so as to obtain the optimal GS model.
  9. 9. The method for predicting the characteristics related to the pine wood quality of the wetland based on the mark preselection and the genome selection based on the whole genome correlation analysis according to claim 8 is characterized in that in the step (2), an optimal linear unbiased prediction model based on the genome is used as a basic model, in the step (4), a genome selection prediction model is constructed by using a significant mark subset, and a large effect mark subset is used as a fixed effect and is incorporated into the genome selection prediction model, the genome selection prediction model is a genome optimal linear unbiased prediction or a Bayesian Lasso model based on a mixed linear model, a training set accounts for 70-90% of a sample set, and the number of the mark subsets is 3K-100K.
  10. 10. A prediction system for pre-selecting wetland pine wood related characters combined with genome selection based on markers of whole genome association analysis, which is characterized by comprising a data acquisition module, a quality control module, an association analysis module, a marker pre-selection module, a model construction module, an evaluation module and a result output module, wherein the prediction system is used for realizing the prediction method for the wetland pine wood related characters combined with genome selection based on markers of whole genome association analysis according to any one of claims 1 to 9; The system comprises a data acquisition module, a quality control module, a correlation analysis module, a marker preselection module, a model construction module, an evaluation module and a result output module, wherein the data acquisition module is used for acquiring phenotype data and genotype data of a pine sample, the quality control module is used for carrying out standardized processing on the phenotype data and carrying out quality filtering on the genotype data, the correlation analysis module is used for executing whole genome correlation analysis and screening significant correlation SNP, the marker preselection module is used for generating a marker subset, the model construction module is used for constructing a genome selection prediction model, the evaluation module is used for carrying out cross verification and comparison on the prediction capability of the prediction model, and the result output module is used for outputting a prediction breeding value of the wood property correlation property of a target pine.

Description

Method and system for predicting wetland pine wood property related characters by combining marker preselection and genome selection based on whole genome association analysis Technical Field The invention relates to the technical field of wetland pine wood property related property prediction. In particular to a method and a system for predicting wetland pine wood property related characters combined with genome selection based on the mark preselection of whole genome association analysis. Background Genome Selection (GS), also known as Genome Prediction (GP), is an advanced breeding technique that utilizes high-throughput DNA genotyping data to predict genetic value of individuals. Unlike traditional Best Linear Unbiased Prediction (BLUP), which relies solely on phenotype and pedigree information, GS builds predictive models by combining training populations of phenotype data and whole genome marker data to estimate Genome Estimated Breeding Values (GEBVs). Genomic Estimated Breeding Value (GEBVs) reflects the genetic potential of individuals and early selection can be achieved in the absence of phenotypic data. The core advantages of GS include the ability to predict individuals who are not phenotyped, to speed up the breeding cycle and increase the genetic gain per unit time, and to increase the efficiency of low genetic trait selection. In breeding programs for animals, crops and forest trees, the use of GS has been continuously demonstrated to have a higher genetic gain potential than traditional breeding methods. The predictive performance of GS is typically assessed by predictive Power (PA) and predictive accuracy (PC). The measurement of PA is the Pearson correlation between observed phenotype and GEBV in the validated population, while the calculation of PC is the ratio of PA to square root of the narrow genetic transmission (h 2) of the trait. Key factors affecting predictive performance include training set (TP) size, number and density of markers, and statistical model employed. Currently, widely used statistical models include parametric methods such as GBLUP, bayesA, bayesB, bayesian lasso, and non-parametric algorithms such as random forest support vector machines. In general, increasing TP size may enhance the ability to capture LD (Linkage Disequilibrium ) patterns and improve predictive performance. Furthermore, studies on forest trees have shown that in the same generation of cross-validation, thousands of randomly selected Single Nucleotide Polymorphisms (SNPs) are often sufficient to capture most genetic variations and achieve prediction accuracy comparable to models using all available markers. Conifers, due to their large and complex genome, often exceeding a size of 20 Gb, coupled with complex linkage disequilibrium patterns, present unique challenges to genome selection prediction. Advances in high throughput genotyping technology and reduced sequencing costs have driven the development and use of a variety of genotyping platforms, including SNP chips, exon trap sequencing, and sequencing Genotyping (GBS). These tools have been successfully applied to major conifer species such as spruce (Picea abies), red pine (Pinus sylvestris), loblolly pine (Pinus taeda), pinus massoniana (Pinus elliottii) and pinus massoniana (Pinus massoniana). The development of genetic improvement is limited by the genetic complexity of wet-land pine in relation to pulp traits, such as its long reproductive age and growth, wood fiber characteristics, chemical composition, etc. While GS has shown great potential in tree breeding, its systematic application in the properties of pine pulp is still limited. Disclosure of Invention Therefore, the technical problem to be solved by the invention is to provide a method and a system for predicting the characteristics related to the characteristics of the wetland pine wood based on the mark preselection of the whole genome correlation analysis and the genome selection, so as to solve the problem of poor GS prediction performance for the characteristics related to the characteristics of the wetland pine wood and the like. In order to solve the technical problems, the invention provides the following technical scheme: A method for predicting wetland pine wood property-related traits combined with genome selection based on marker preselection of whole genome association analysis, comprising the steps of: Step (1), acquiring a phenotype data set and a genotype data set of a target property of a pine sample of the wetland as a sample set; calculating a narrow-sense genetic force corresponding to the target property, and evaluating fitting degrees of different prediction models to determine a basic model; dividing a sample set into a training set and a verification set by adopting a k-fold cross verification mode, performing whole genome linkage disequilibrium analysis (LD) and whole genome association analysis (GWAS) on a target character based on the training set, identi