CN-121999885-A - Multi-mathematics fusion phenotype prediction method and system based on potential space of self-encoder

CN121999885ACN 121999885 ACN121999885 ACN 121999885ACN-121999885-A

Abstract

The invention provides a multi-group-learning fusion phenotype prediction method and a system based on a potential space of a self-encoder, which belong to the technical field of biological information and machine learning, wherein the method comprises the steps of preprocessing multi-source group-learning data such as genotype, metabolome, hyperspectrum and the like and splicing the multi-source group-learning data into a high-dimensional matrix; the matrix is subjected to unsupervised training by using a self-encoder, is subjected to nonlinear compression to a unified low-dimensional potential space to generate fusion features, a XGBoost and other supervised learning models are constructed by taking the fusion features as input to predict agronomic traits, and an optimal group learning use strategy is recommended according to a preset decision rule by calculating genetic advantages and fusion gains of target traits. According to the invention, heterogeneous multi-mathematics data are effectively integrated through the self-encoder, noise is filtered, inter-mathematics nonlinear relations are excavated, and the prediction precision of complex characters is remarkably improved. Meanwhile, the provided quantitative decision framework provides scientific basis for efficient allocation of the histology resources in breeding practice, and both prediction performance and cost control are considered.

Inventors

Yan Jiapei
LI XINGWANG
GUO MINGYUE
ZENG HAOWEN
GUO LIANGLIANG
XUE ZHIFEI

Assignees

华中农业大学

Dates

Publication Date: 20260508
Application Date: 20251223

Claims (10)

1. A method for predicting a multiple-set of chemical fusion phenotypes based on the potential space of a self-encoder, comprising: S1, preprocessing genotype histology data, at least one metabonomics data and at least one hyperspectral phenotype data from the same sample set to obtain a plurality of histology data matrixes with aligned samples and uniform scales; s2, performing row-column splicing on each preprocessed single-group data matrix according to characteristic dimensions to form a high-dimensional fusion input matrix; s3, constructing a self-encoder neural network comprising an encoder and a decoder, taking the high-dimensional fusion input matrix as input, and training the self-encoder with the aim of minimizing the reconstruction error of input data; S4, constructing and training a supervised learning prediction model by taking the potential space feature matrix as an input feature and taking a phenotype observation value of the target agronomic character as an output label; s5, respectively constructing single-group prediction models which only use single-group data aiming at the target agronomic characters and evaluating the prediction performance of the single-group prediction models, and calculating the fusion gain of genetic dominance GA of genomics and a multi-group fusion model PCC based on GA and And the PCC outputs a recommended university use strategy aiming at the character through a preset decision rule.
2. The method of claim 1, wherein the supervised learning prediction model in S4 is a XGBoost model.
3. The method of claim 1, wherein the encoder and decoder of the self-encoder in S3 are both multi-layer fully connected neural networks, and the activation function is ReLU.
4. The method of claim 1, wherein the preprocessing in S1 comprises performing principal component analysis dimension reduction on genomic data, performing batch effect correction on metabonomic data, and performing smoothing and characteristic band screening on hyperspectral data.
5. The method according to claim 1, wherein in S5, the calculation formula of the genetic dominance GA is: GA=PCC_geno-mean(PCC_other); Wherein PCC_geno is the predicted performance of the genotype monocomponent model, and mean (PCC_other) is the average of the predicted performance of all other monocomponent models; Fusion gain The calculation formula of PCC is: PCC=PCC_fusion-max(PCC_single); Where max (PCC_Single) is the maximum of all single set of mathematical model predictive performances.
6. The method according to claim 1, wherein the preset decision rule in S5 comprises: if GA > T1, then use of only genomics is recommended; if GA is less than or equal to T1 and PCC >0, then use of multiple-genetics fusion is recommended; if GA is less than or equal to T1 and PCC is less than or equal to 0, recommending to use a single group with optimal prediction performance; wherein T1 is a preset positive threshold.
7. The method of claim 1, wherein the metabolomic data comprises penta-leaf blade metabolomic data and/or seed metabolomic data.
8. The method according to claim 1, wherein in S3, a mean square error is used as a reconstruction loss function from the encoder.
9. A system for carrying out the method of any one of claims 1-8, comprising: the data preprocessing module is used for standardizing and cleaning each single group of data; The characteristic splicing module is used for splicing the processed multiple groups of chemical data into a high-dimensional matrix; A self-encoder module, comprising an encoder and a decoder, for training and generating potential spatial features; the system comprises a prediction modeling module, a decision analysis module and a learning analysis module, wherein the prediction modeling module is used for constructing and optimizing a supervised learning prediction model based on potential spatial characteristics, and the decision analysis module is used for calculating genetic advantages and fusion gains and outputting a university use suggestion according to decision rules.
10. The system of claim 9, wherein the predictive modeling module integrates XGBoost algorithms and hyper-parametric optimization tools.

Description

Multi-mathematics fusion phenotype prediction method and system based on potential space of self-encoder Technical Field The invention relates to the technical field of biological information and machine learning, in particular to a multi-group-learning fusion phenotype prediction method and system based on a potential space of a self-encoder. Background Along with the rapid evolution of modern crop breeding systems to high-efficiency and intelligent directions, accurate prediction of complex agronomic traits has become a key scientific challenge affecting breeding efficiency and genetic improvement progress. The traditional phenotype evaluation mode relying on field measurement is long in period, high in resource consumption, and easy to be interfered by external environment fluctuation such as climate, soil condition and the like, so that the stability of a measurement result is insufficient, and the repeatability is reduced. In the context of ever-increasing demand for genetic improvement, this approach, which relies on later-stage measurement, is difficult to support rapid screening and early selection of large-scale materials. The provision of genomic selection (Genomic Selection, GS) alleviates this bottleneck to some extent, improving the prospective and efficiency of breeding decisions by predicting phenotypic manifestations using whole genome molecular markers. However, in many complex traits that are synergistically affected by environmental factors, metabolic regulation, developmental states, and the like, the degree of variability that genotype information itself can interpret is limited, presenting a significant predictive bottleneck. The information dimension of a single chemical structure is single, the regulation and control level is thin, and the regulation and control logic of cross-level and cross-channel in the process of forming the characters is difficult to cover, so that the model has poor performance on the characters with complex nonlinear relations or strong environmental sensitivity. In this context, the rapid development of multiple-mathematics technology provides a new technological path for breaking through the existing predictive capabilities. Metabonomics can reflect the physiological and biochemical states of plants at specific growth stages, and hyperspectral phenotypes can capture comprehensive characteristics of canopy structure, pigment content, photosynthetic capacity and the like in a non-destructive manner, and the information is often closely related to the trait formation process and has dynamic and environmental responsiveness. Along with the improvement of instrument performance and the popularization of a high-throughput platform, researchers can observe the growth and development process of crops from a plurality of hierarchical systems such as genotypes, metabolite spectrums, physiological states, plant external phenotype structures and the like, and form a more complete description framework for the regulation and control mechanism of complex characters. The introduction of multiple groups not only enriches the information sources for predictive modeling, but also makes it possible to construct models with more generalization capability by using the complementary relations among different groups. How to realize effective multi-mathematics fusion on the premise of remarkable difference of heterogeneous data structures and noise levels and complex association relation of histology, and to use the multi-mathematics fusion to improve the prediction performance of complex characters becomes a key scientific problem to be solved in the field of current breeding informatics and crop phenotype prediction. However: (1) The problems of large dimension difference, high noise, heterogeneous distribution and the like exist among different groups; (2) Redundancy is easy to introduce in a simple splicing mode, and a disturbance model is trained; (3) Complex nonlinear relations exist among multiple groups of learns, and a traditional linear model cannot be learned; (4) The character has great dependence on information sources, and the optimal group of different characters is inconsistent. Therefore, a new method for automatically extracting a cross-group-study sharing structure and realizing efficient fusion under the conditions of high heterogeneity of multi-group-study information, inconsistent noise level and complex relationship among variables is needed to be constructed. The method breaks through the hypothesis limitation of the traditional linear model, can capture the potential nonlinear association mode among the multi-layer data such as genotype, metabolome and hyperspectral, and the like, and can keep key biological signals while reducing redundancy and noise influence, thereby providing more comprehensive, stable and generalizable prediction characteristic representation for complex agronomic characters. The establishment of the method provides key technical supp