CN-121983124-A - Marker screening and model construction method for accurate evaluation of parent genome breeding values

CN121983124ACN 121983124 ACN121983124 ACN 121983124ACN-121983124-A

Abstract

The invention discloses a marker screening and model construction method for accurate evaluation of a parent genome breeding value, and belongs to the technical field of molecular breeding and genomics. The method comprises the steps of obtaining genotyping data and target character phenotype data of a reference population, performing digital coding, calculating Minor Allele Frequencies (MAFs) and linkage disequilibrium scores (LD score) of all marking sites, screening to obtain a core marking subset with higher genetic stability and linkage structure representativeness based on joint screening rules of the MAFs and the LD score, and constructing a differentially weighted genome prediction model based on the core marking subset to estimate genome breeding values of parent individuals. The invention reduces the interference of low-frequency or weak linkage sites on the prediction model by using the allele frequency information and linkage disequilibrium structure information, is beneficial to improving the prediction accuracy of parent breeding values in predicting offspring phenotype across generations, and is suitable for genome prediction and molecular breeding of complex characters.

Inventors

YANG BEN
LIU SHIKAI

Assignees

中国海洋大学

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (8)

1. The marker screening and model construction method for accurately evaluating the breeding value of the parent genome is characterized by comprising the following steps: step one, acquiring SNP typing data of the whole genome of a reference population and target character phenotype data, and performing quality control and numerical coding on the SNP typing data to obtain a genotype matrix X and a phenotype vector y; Step two, for each SNP locus in the genotype matrix X, calculating Minor Allele Frequency (MAF) and linkage disequilibrium score (LD score) of the SNP locus; Step three, constructing a joint screening rule based on the MAF and the LD score, and screening to obtain a SNP locus set meeting the conditions at the same time according to a preset threshold or a locating criterion, wherein the SNP locus set is used as a core feature subset S core ; Constructing a genome prediction model based on the core feature subset S core , giving higher weight to SNP loci in the core feature subset, or constructing the prediction model based on the core feature subset only so as to establish a mapping relation between genotype data and a prediction breeding value; Typing candidate parent groups, obtaining predicted breeding values of candidate parents by using the genome prediction model, performing parent selection to construct families according to the predicted breeding values, performing phenotype measurement on the child generations, and calculating phenotype average values of the child generations of the families; Taking the family as a unit, calculating the average value or weighted combination of the predicted breeding values of parent individuals of the same family as the family breeding value, and calculating the correlation coefficient between the family breeding value and the average value of the phenotype of the corresponding family progeny as an evaluation index of the prediction accuracy of the parent genome breeding value.
2. The method of claim 1, wherein in step one, the quality control comprises eliminating individuals and loci with genotype detection rates below 80%, and wherein the numerical coding encodes homozygous reference genotypes as 0, heterozygous genotypes as 1, and homozygous mutant genotypes as 2 according to an additive model.
3. The method according to claim 1, wherein in the second step, the linkage disequilibrium score parameter is calculated by dividing the SNP loci according to chromosomes, traversing each SNP locus on each chromosome in turn, constructing a sliding window in a physical distance range of 100 kb on the upstream and downstream of the SNP loci with the SNP locus as a center, calculating a pearson correlation coefficient r 2 between the target SNP locus and other SNP loci in the sliding window, and accumulating the r 2 values to obtain the linkage disequilibrium score parameter corresponding to the SNP locus.
4. The method of claim 1, wherein in step three, the minor allele frequency threshold and the linkage disequilibrium score threshold are set based on the distribution of SNP sites throughout the genome, respectively, and the core feature subset is a set of SNP sites located simultaneously in the minor allele frequency distribution hyperlocus interval and the linkage disequilibrium score distribution hyperlocus interval.
5. The method according to claim 1, wherein in the third step, the higher-order locus section of the minor allele frequency distribution is the first 10% locus section of the minor allele frequency distribution of the SNP locus of the whole genome, and the higher-order locus section of the linkage disequilibrium score distribution is the first 10% locus section of the linkage disequilibrium score distribution of the SNP locus of the whole genome.
6. The method of claim 1, wherein in step four, the genomic prediction model is a linear model or a nonlinear model, comprising one or more of a linear hybrid model, a random effect model, and a machine learning model, or an equivalent deformation model thereof.
7. The method of claim 1, wherein in the fifth step, the object to be measured of the offspring phenotype is the same target trait as the reference population, and the offspring phenotype data is obtained by using the measuring method, the evaluation criteria and the test conditions consistent with the phenotype vector y of the reference population.
8. The method according to claim 1, wherein in the sixth step, the correlation coefficient is calculated by: ; Wherein, the Represents the breeding value of the ith family, Represents the mean value of all family breeding values, Represents the mean of the progeny phenotype of the ith family, Mean of all the offspring phenotype means of the family, N represents the number of families.

Description

Marker screening and model construction method for accurate evaluation of parent genome breeding values Technical Field The invention belongs to the technical field of molecular breeding and genomics, and particularly relates to a marker screening and model construction method for accurate evaluation of a parent genome breeding value. Background With the continuous development of life sciences and information technology, molecular breeding techniques represented by genome selection (Genomic Selection, GS) are becoming an important breeding technique means following traditional phenotypic selection and molecular marker assisted selection. Genome selection is realized by laying high-density single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP) markers in the whole genome range and constructing a prediction model between genotypes and target traits by combining the phenotype data of individuals, so that the early evaluation and screening of candidate individual genome breeding values (Genomic Estimated Breeding Value, GEBV) are realized. The technology can make a breeding decision under the condition of not depending on complete phenotype measurement, is beneficial to shortening the breeding period and reducing the breeding cost, has obvious advantages in the aspect of treating characters with lower genetic force or difficult phenotype measurement (such as disease resistance, growth efficiency and the like), and has been widely applied to the breeding practice of livestock, crops and aquatic breeding species. In the prior application, the genome selection technology generally adopts the following technical flow that firstly, a reference population containing known genotype data and phenotype data is constructed as a training sample set, the mapping relation between genotype characteristics and target traits is established through a statistical modeling method, and then, the constructed prediction model is applied to candidate populations only with genotype information so as to realize early breeding and evaluation of individuals. Common modeling methods include linear hybrid model-based prediction methods, bayesian regression models, and other machine learning methods, which have been applied in breeding practices for multiple species. However, in the actual industrial application process, the existing genome selection technology still faces a series of technical problems to be solved urgently in the aspect of parent breeding value prediction and cross-generation application. Firstly, the stability of the prediction model in the cross-generation application scene is insufficient. Because the number of SNP markers in the whole genome is usually far higher than the number of reference population samples for modeling, the problems of unstable parameter estimation, feature redundancy, noise accumulation and the like easily occur in the process of establishing a parent breeding value prediction model, so that the dependence of GEBV of parent individuals on training data is enhanced. Under cross-validation conditions within the reference population, a higher correlation between the tester set GEBV and its own phenotype is often obtained, but in practical breeding applications, the main use of the candidate parent GEBV is to predict the trait performance of its offspring population. When the candidate parent GEBV is used in a cross-generation breeding decision, the correlation between GEBV and its progeny phenotype, or with the family trait index based on progeny phenotype statistics, tends to be significantly reduced. The phenomenon shows that the problem of insufficient stability of cross-generation prediction exists in the prior art in the prediction of the parent breeding value, namely that a part of marks or effect estimation which are outstanding in the reference population is difficult to keep consistent prediction contribution in the offspring population, so that the sequence of the parent breeding values among different generations is rearranged, and the reliability of the parent breeding scheme formulation and the genetic gain estimation is affected. Secondly, the existing marker utilization strategy is difficult to simultaneously consider prediction accuracy and cross-sample stability. In the prior art, the explanation capacity of the model for the character variation is improved by screening the size of the marking effect or the statistical significance. However, in the process of biological generation replacement, due to the influence of factors such as genetic recombination, allele frequency variation, population structural remodeling and the like, linkage relation and statistical characteristics among different SNP markers may be changed, so that the prediction effect of a part of markers with higher prediction contributions in training samples in subsequent samples or offspring populations is unstable. This instability is particularly evident in the application scenario across generatio