MGG_2025v16n6

Maize Genomics and Genetics 2025, Vol.16, No.6, 304-315 http://cropscipublisher.com/index.php/mgg 311 In simple terms, deep learning is more like an automatic and efficient "black box machine", while Bayesian models are more like a controllable "tool system". One emphasizes speed in calculation, while the other focuses on clarity. If they can cooperate with each other, the prediction effect can often be pushed to a more ideal level. 6.3 Strategies for optimizing training population design and marker density No matter how advanced a model is, it still needs reliable data to support it. How to build the training population and determine the label density, these seemingly detailed issues often determine the overall performance of the MEGP model. Rather than blindly pursuing sample size, it is better to make the data more "representative". Typically, researchers will first select samples based on the genetic background and environmental characteristics of the target population, and then use cluster analysis or principal component analysis to capture the population structure (Gevartosky et al., 2021). Interestingly, the experimental results show that the training set does not necessarily have to be astonishingly large. As long as it is properly designed, an effective training set accounting for 2% to 13% of the total population is sufficient, and the prediction accuracy can still be maintained at a relatively high level. Especially when the training set contains related individuals of the validation population, the performance of cross-population prediction tends to be better (Guo et al., 2019). In addition, increasing the density of markers and introducing functional or character-specific markers are also common optimization methods. This can make the model more "sensitive" to complex genetic structures and perform more detailed and closer to the real situation in the prediction results (Roth et al., 2022). 7 Challenges and Limitations in Multi-Environment Prediction 7.1 Computational complexity and data dimensionality issues In multi-environment genomic prediction (MEGP), the real difficulty is often not the algorithm itself, but rather the excessive amount of data and the overly complex relationships. With tens of thousands of SNP markers, along with dozens of environmental covariates and their interactions, it is very easy for the model to get stuck in a "dimensional quagmire". The more parameters there are, the more difficult the estimation becomes, and the operation time naturally doubles as well. Some researchers have attempted to solve this problem using machine learning or kernel-based methods, such as CatBoost, XGBoost or deep kernel models. These methods can indeed improve efficiency to a certain extent, but they are not omnipotent. Once a model contains too many variables, even if the dimensionality reduction is handled properly, it may still face the risk of overfitting. Some people have proposed that by reducing the dimension of genetic data and retaining the main environmental variables, it is possible to maintain the predictive ability while also reducing the computational burden. The problem is that such "trade-offs" sometimes sacrifice precision, especially in complex situations where the interaction between genes and the environment is significant. In other words, there is always an unbalanced line between computational feasibility and predictive reliability. 7.2 Limited transferability across populations and environments Even if the model performs well in the original data, once the environment is changed, the results are often not so ideal. The portability issue of the MEGP model has been repeatedly mentioned-the prediction accuracy is the highest when the environments of the training set and the test set are similar. However, once it moves to a new area or under untested conditions, the accuracy will decrease significantly (Li et al., 2021). What's more interesting is that the key factor influencing the prediction performance seems not to be genetic similarity, but rather the degree of proximity of the environment. That is to say, the model relies more on "environmental memory" rather than "genetic relationships". In some cases, incorporating the kinship materials of the validation population into the training set can indeed improve cross-environmental prediction, but this operation is not always achievable in actual breeding. What is even more difficult is that the genotype × environment (G×E) interaction itself is extremely complex. Even the most advanced nonlinear models may not be able to fully capture all variations in the new environment (Alves et al., 2021). Therefore, the generalization of the MEGP model remains an unsolved problem.

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==