CN-121983116-A - Universal SNP molecular marker for predicting multiple phenotypic traits of corn, screening method and prediction system thereof

CN121983116ACN 121983116 ACN121983116 ACN 121983116ACN-121983116-A

Abstract

The invention belongs to the technical field of molecular biology, and particularly relates to a universal SNP molecular marker for predicting multiple phenotypic traits of corn, a screening method and a prediction system thereof. According to the invention, through vectorization conversion of genotype data, mutual information screening, LD filtering, LASSO and random forest fusion screening are carried out on the genotype data after conversion for three times to reduce dimensions, invalid and redundant marks are removed layer by layer, so that a universal mark suitable for predicting a plurality of phenotype traits of corn is developed, and compared with the characteristic screening of traditional whole genome selection, the cost is reduced and the breadth of screenable traits is improved. According to the invention, an integrated model of a self-attention neural network, LASSO and random forest is constructed, the advantages of the three models are fused through advantage complementation and weight optimization among different models, the limitation of a single model is made up, and finally, the accurate and efficient prediction of a plurality of corn phenotype traits is realized, and the breeding period is shortened.

Inventors

SONG JUNQIAO
LI JING
SONG XIAOJI
ZENG ZHANKUI
QU JINGTAO
SHI LILI
ZHANG PAN
LU DAOWEN

Assignees

河南开放大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (10)

1. A screening method of universal SNP molecular markers for predicting various phenotypic traits of corn is characterized by comprising the following steps: (1) Obtaining genotype data and phenotype data of corn from a database, wherein the phenotype data comprises corn stem rot resistance, yield, seed water content, plant height, spike height, relative spike position, number of plant stems, filament drawing period, flowering period and seed volume weight; (2) Performing quality control on genome data, removing SNP markers without genetic variation, and converting SNP base data of original characters into 4-dimensional vectors to obtain initial characteristics; (3) Calculating mutual information of each initial feature and the phenotype data, and reserving the feature that the mutual information is more than or equal to 0.025 to obtain a first dimension reduction feature; (4) LD filtering is carried out on the first dimension reduction feature to obtain a second dimension reduction feature; (5) Screening the second dimension reduction features based on LASSO regression to obtain features with linear influence on the surface data; (6) Screening the second dimension reduction features based on a random forest algorithm to obtain nonlinear features with large contribution to the surface data; (7) Obtaining a third dimension reduction feature to obtain a candidate SNP molecular marker by taking a union of the feature having linear influence on the pair-type data and the nonlinear feature with large contribution to the pair-type data; the step (5) and the step (6) have no time sequence.
2. The screening method according to claim 1, wherein the step (1) further comprises, after obtaining the genotype data and the phenotype data of the corn from the database, removing the corn samples with missing IDs or incomplete phenotype data by ensuring one-to-one correspondence between genotypes and phenotypes through unique sample IDs; step (2) achieves quality control of genomic data by nunique () = 1 in Pandas; the method for converting the SNP base data of the original character type into the 4-dimensional vector comprises the steps of establishing a base probability mapping table, and mapping each base type to one 4-dimensional vector; step (3) of calculating mutual information of each initial feature and the phenotype data comprises calculating average mutual information of each initial feature and the phenotype data; The LD filtering in the step (4) comprises that only one SNP locus with highest mutual information is reserved when LD r 2 of two or more SNP loci is more than 0.56.
3. The universal SNP molecular marker screened by the screening method of claim 1 or 2.
4. The universal SNP molecular marker of claim 3, wherein the site information of the SNP molecular marker is as shown in table 2 of the specification.
5. The construction method of the corn multiple phenotype character prediction model is characterized by comprising the following steps: converting the base data of SNP molecular markers of the corn sample into 4-dimensional vectors to obtain vector data; Inputting the vector data and the multiple phenotype character data of the corn sample into different training models, and dynamically adjusting the weight duty ratio of the different training models to obtain multiple phenotype character prediction models of the corn; The corn multiple phenotype character data comprise corn stem rot resistance, yield, seed water content, plant height, spike height, relative spike position, number of standing poles, silk drawing period, flowering period and seed volume weight; The different training models comprise an automatic attention neural network, a LASSO regression model and a random forest model, and the sum of weights of the automatic attention neural network, the LASSO regression model and the random forest model is 1.
6. The method according to claim 5, wherein the SNP molecular marker is the SNP molecular marker according to claim 3 or 4.
7. The method according to claim 5 or 6, wherein the self-focusing neural network has a weight ratio of 0.3082 in the prediction system, the LASSO regression model has a weight ratio of 0.3501 in the prediction system, and the random forest model has a weight ratio of 0.3417 in the prediction system.
8. The corn multiple phenotype character prediction system is characterized by comprising an SNP molecular marker screening module and a multimode integration module, wherein the SNP molecular marker screening module comprises a data acquisition module, a data processing module, a mutual information screening module, an LD filtering module, an LASSO and a random forest fusion screening module; the data acquisition module is used for collecting corn genotype data and phenotype data; The data processing module performs quality control on the corn genome data to obtain vector features; the mutual information screening module performs first dimension reduction on the vector characteristics to obtain first dimension reduction characteristics; the LD filtering module performs second dimension reduction on the first dimension reduction feature to obtain a second dimension reduction feature; the LASSO and random forest fusion performs third dimension reduction on the vector features by the second dimension reduction feature screening module to obtain screened SNP molecular markers; The multi-model integration module constructs a corn multi-phenotype prediction model according to the construction method of any one of claims 5-7.
9. Detecting the SNP molecular marked product of claim 3 or 4 or the corn multiple phenotype prediction model constructed by the construction method of any one of claims 5-7 or the application of the corn multiple phenotype prediction system of claim 8 in predicting corn phenotype and/or corn assisted breeding, wherein the multiple phenotype comprises corn stem rot resistance, yield, grain water content, plant height, spike height, relative spike position, number of stands, silking period, flowering period and grain volume weight.
10. A method for predicting multiple phenotype traits of corn is characterized in that base data of SNP molecular markers of a corn sample are converted into 4-dimensional vectors, vector data are obtained and then input into a prediction model, and multiple phenotype trait results of corn are obtained, wherein the prediction model is the multiple phenotype trait prediction model of corn constructed by the construction method according to any one of claims 5-7; The multiple phenotypic traits comprise corn stem rot resistance, yield, seed water content, plant height, spike height, relative spike position, number of stands, wire drawing period, flowering period and seed volume weight.

Description

Universal SNP molecular marker for predicting multiple phenotypic traits of corn, screening method and prediction system thereof Technical Field The invention belongs to the technical field of molecular biology, and particularly relates to a universal SNP molecular marker for predicting multiple phenotypic traits of corn, a screening method and a prediction system thereof. Background Corn is used as a grain crop with the largest planting area and highest yield in China, and is important for the grain safety in China. At present, maize breeding in China is still in the transformation stage from traditional phenotypic selection to biological breeding. For quality traits controlled by major genes, developing molecular markers for assisting breeding selection is the most effective method, but for quantitative traits controlled by minor polygenes, marker-assisted breeding selection effect is poor, and most of the agronomic traits to be improved in breeding are quantitative traits. The whole genome selection (Genomic selection, GS) is a novel method for selecting and breeding by using high-density markers covering the whole genome, can shorten the generation interval by early selection, improve the estimation accuracy of breeding values (Genomic Estimated Breeding Value, GEBV) and the like, quicken the genetic progress, has better prediction effect on complex characters with low genetic transmission and difficult measurement, and truly realizes genome technology guided breeding practice. However, the markers of each trait selected based on the whole genome are not uniform, each trait to be improved has a marker to be measured, so that traits except the target trait still need to be complemented by phenotype screening, the difficulty of data integration is increased, and if each trait is predicted by measuring the corresponding marker, the cost is doubled. In addition, in the prior art, when predicting the trait, a single model, such as Stacking (Stacking integration) and bayesian model averaging (Bayesian Model Averaging, BMA), is usually utilized, however, stacking go and bayesian model averaging have high computational complexity, consume large resources, have high requirements on hardware, and are easy to be overfitted. Therefore, the method for screening the universal marker for screening the multiple agronomic traits optimizes a prediction system, comprehensively, economically and efficiently solves the comprehensive prediction of the multiple agronomic traits, and is particularly important to the field. Disclosure of Invention The invention aims to provide a universal SNP molecular marker for predicting multiple phenotypic traits of corn, a screening method and a prediction system thereof, which simultaneously predict the phenotypic traits of multiple corn, realize early selection of corn germplasm, shorten generation interval and reduce genotype acquisition cost and breeding cycle. The invention provides a screening method of SNP molecular markers related to various phenotypic traits of corn, comprising the following steps: (1) Obtaining genotype data and phenotype data of corn from a database, wherein the phenotype data comprises corn stem rot resistance, yield, seed water content, plant height, spike height, relative spike position, number of plant stems, filament drawing period, flowering period and seed volume weight; (2) Performing quality control on genome data, removing SNP markers without genetic variation, and converting SNP base data of original characters into 4-dimensional vectors to obtain initial characteristics; (3) Calculating mutual information of each initial feature and the phenotype data, and reserving the feature that the mutual information is more than or equal to 0.025 to obtain a first dimension reduction feature; (4) LD filtering is carried out on the first dimension reduction feature to obtain a second dimension reduction feature; (5) Screening the second dimension reduction features based on LASSO regression to obtain features with linear influence on the surface data; (6) Screening the second dimension reduction features based on a random forest algorithm to obtain nonlinear features with large contribution to the surface data; (7) Obtaining a third dimension reduction feature to obtain a candidate SNP molecular marker by taking a union of the feature having linear influence on the pair-type data and the nonlinear feature with large contribution to the pair-type data; the step (5) and the step (6) have no time sequence. Preferably, after the genotype data and the phenotype data of the corn are obtained from the database, the step (1) further comprises the steps of ensuring one-to-one correspondence between the genotype and the phenotype through unique sample IDs, and eliminating the corn samples with missing IDs or incomplete phenotype data; step (2) achieves quality control of genomic data by nunique () = 1 in Pandas; the method for converting the SNP base data of the original