LGG_2025v16n2

Legume Genomics and Genetics 2025, Vol.16, No.2, 91-99 http://cropscipublisher.com/index.php/lgg 96 6.2 Trait prediction of breeding materials in China’s Huang-Huai Region The situation of breeding work in the Huanghuai region is not so ideal. The climate fluctuates greatly and management methods vary greatly. Relying solely on traditional methods is indeed a struggle. Against this backdrop, machine learning models are highly regarded because they can take into account complex genotype-environmental interactions. Especially when combined with feature selection and integrated models, some customized predictions can better meet the actual needs of the region. Whether it is a variety with strong adaptability or materials with relatively stable traits, the model can screen them out in advance, which has gained a lot of initiative for local breeding (Parmley et al., 2019). 6.3 Prediction evaluation and selection efficiency in commercial high-protein/high-yield varieties In real commercial scenarios, goal-oriented traits such as high protein and high yield were mainly achieved through linear models in the past, but now machine learning is gradually becoming the main force. It's not that traditional models are ineffective; rather, in situations where trait inheritance is complex and nonlinear interaction is obvious, the advantages of ML begin to emerge. Especially the integrated model or deep learning architecture combined with feature optimization can not only improve the prediction accuracy, but also shorten the computing time (Yoosefzadeh-Najafabadi et al., 2021). From the perspective of breeding enterprises, the efficiency of screening out high-quality varieties has indeed improved. That is to say, it is no longer a luxury to turn potential materials into market products more quickly and accurately. 7 Conclusions and Future Perspectives Machine learning has indeed performed outstandingly in genomic prediction in recent years, but it is not without flaws. Some regularized regressions, ensemble models, and even deep learning architectures can indeed better handle the complex "genotype-phenotype" relationships brought about by high-dimensional data, especially when the genetic structure of traits is not simple to begin with. However, to be fair, no matter how high the accuracy rate is, the model is still picky about the data. The scale, quality and target traits of the data will all directly affect its performance. Sometimes, when deep models are used, the training time is several times longer, but the result is only slightly better than that of traditional linear models. Moreover, many ML methods are inherently "black box" and cannot explain why predictions can succeed - this is actually quite fatal for breeding decisions (especially in the context of biological research). By the way, relying on a single data source is indeed not enough. Nowadays, more and more people are beginning to try to incorporate phenotypic, multi-omics (such as genomic, transcriptomic, proteomic) and even remote sensing data, making the models more "diverse" and comprehensive. Deep learning is quite popular in this regard and has a decent ability to handle heterogeneous and multimodal data. Once the model can capture the "signals" from such complex combinations, it will be more confident in predicting the traits that are influenced by both environmental and genetic factors. Moreover, using public data and commercial data together, supplemented by interpretable modeling, sounds like a truly feasible approach for the breeding scenario (rather than just a laboratory demo). Ultimately, the future development of ML breeding is more like making up for deficiencies in several directions: one is speed, no one wants to train a model for several days; One aspect is interpretability. It's not just about "calculating accurately", but also about "explaining clearly". Another point is how to connect multi-source data. Don't let them be scattered here and there. Automated and interpretive tools like AutoML and iML are already on the way. Their goal is simple - to enable breeders to "get started", rather than being deterred by a bunch of complex models. As data accumulates more and more and species are studied more deeply, these models will eventually have to support the core pillar of smart agriculture, especially when developing soybean varieties that are high-yielding, stress-resistant and have a relatively high protein content. Acknowledgments We would like to express our gratitude to the reviewers for their valuable feedback, which helped improve the manuscript.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==