LGG_2025v16n2

Legume Genomics and Genetics 2025, Vol.16, No.2, 91-99 http://cropscipublisher.com/index.php/lgg 94 additive, and sometimes dominant or dominant. How to combine them depends on the target and data situation. Environmental information should not be overlooked either. Factors such as the planting location, year, and even management methods, although diverse, make the model more stable when added. Especially when considering the interaction between genotype and environment, these variables can provide additional information support and are very helpful for improving the adaptability of prediction (Norberg et al., 2019). 4.2 Accuracy comparison and error analysis across ML models Everyone wants to find the "strongest model", but the reality is often that different algorithms perform quite differently when facing different datasets. Integrated methods like Random Forest (RF) and Gradient Elevator (GBM) do a relatively solid job in capturing nonlinearity and complex relationships. Although SVM is not the most complex model, it has the advantages of fast calculation and acceptable accuracy, especially when the amount of data is large (Chakraborty et al., 2020). Deep learning is also often mentioned. For example, the multi-layer perceptron (MLP) can indeed achieve high accuracy, but it also has high requirements for sample size and parameter tuning. If not tuned properly, overfitting is easy (Chandra and Goyal, 2021; Nguyen et al., 2021). Which model is good after all? It is difficult to explain in one sentence. Usually, it is necessary to compare metrics such as R², RMSE, and MAE to see the actual error. Incidentally, the performance of the training set and the test set also needs to be compared to determine whether there is overfitting (Robinson et al., 2017; Zhou et al., 2021). When choosing a model in the end, people usually do not rely on just one. Instead, they try several types and then select the one with stable performance through cross-validation. 4.3 Optimization of multi-trait joint prediction strategies If there is more than one goal, for instance, one wants to predict both yield and protein content at the same time, then a different approach is needed. Multi-task learning frameworks, or model stacking, are commonly used practices at present. They can share genetic and environmental information to improve overall performance. Adding some feature selection methods or hybrid models can usually boost the accuracy and explanatory power (Chakraborty et al., 2020). However, how these strategies are combined still depends on specific requirements, such as whether cross-environment prediction is needed or whether they are related to multiple traits. It cannot be generalized. 5 Multi-Environment Trials and Environmental Interaction Modeling 5.1 Impact of G×E interactions on prediction accuracy The performance of soybeans in different environments is actually not that easy to predict. Even if the genotypes are the same, changes in climate or planting and management methods may still cause deviations in traits. This so-called interaction between genotype and environment (G×E) often leads people to underestimate or overestimate the true potential of a certain strain (Burgueno et al., 2011; Li et al., 2024). Especially for traits like yield or protein that are influenced by multiple factors, if G×E is not taken into account, the prediction results are often unreliable, and the selected materials are also hard to guarantee stability. Multi-environment tests (MET) and more complex statistical modeling come into play at this point - not all models can capture this interaction, but as long as the models keep up, they can get closer to the real performance. 5.2 Model expansion with environmental covariates (e.g., E-GP, reaction norm models) The poor performance of many predictive models may not lie in "genes", but rather in the fact that they have overlooked the variable of "environment". Once data such as soil, climate and management measures are incorporated into the model, things become different. New methods such as the reaction norm model, E-GP, and the hybrid model of factor analysis are precisely aimed at the G×E problem. They can make the model both more accurate and easier to interpret (Piepho and Williams, 2024). Moreover, their advantages lie not only in the better processing of current data, but also in their ability to predict genotype expression in untested environments, and even help to develop management plans that are suitable for the environment (Mumford et al., 2023). This type of model provides good tool support for breeding varieties that are resistant to climate fluctuations.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==