LGG_2025v16n2

Legume Genomics and Genetics 2025, Vol.16, No.2, 91-99 http://cropscipublisher.com/index.php/lgg 91 Feature Review Open Access Genomic Prediction of Yield and Protein Traits in Soybean Using Machine Learning Models Xingde Wang, Tianxia Guo Institute of Life Sciences, Jiyang College, Zhejiang A&F University, Zhuji, 311800, Zhejiang, China Corresponding email: tianxia.guo@jicat.org Legume Genomics and Genetics, 2025 Vol.16, No.2 doi: 10.5376/lgg.2025.16.0010 Received: 20 Feb., 2025 Accepted: 06 Apr., 2025 Published: 27 Apr., 2025 Copyright © 2025 Wang and Guo, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Wang X.D., and Guo T.X., 2025, Genomic prediction of yield and protein traits in soybean using machine learning models, Legume Genomics and Genetics, 16(2): 91-99 (doi: 10.5376/lgg.2025.16.0010) Abstract As a globally significant food and plant protein crop, the yield and protein content of soybeans are the core target traits in breeding. However, due to the influence of the interaction between the genetic background and environment of complex quantitative traits, the efficiency of traditional phenotypic selection and genetic improvement is limited. To enhance breeding efficiency and prediction accuracy, this study explored the applicability and effectiveness of multiple machine learning algorithms in the genomic prediction of soybean yield and protein traits. Based on the genotype (SNP) and phenotypic data of multiple soybean breeding populations in this study, machine learning models such as RR-BLUP, Support vector Machine (SVM), Random Forest (RF), Gradient enhancer (GBM), and Deep neural Network (DNN) were respectively constructed. Combined with feature selection methods such as principal Component Analysis (PCA), LASSO and Boruta, the prediction accuracy and stability of the model are systematically evaluated. The results show that nonlinear models (such as RF and GBM) have better generalization ability for complex traits under multiple environmental conditions. The multi-trait joint prediction strategy further enhanced the model's performance in composite indicators such as protein yield. This study demonstrates the potential of machine learning techniques in the genomic prediction of complex quantitative traits, providing an efficient means for auxiliary selection in soybean breeding and laying the foundation for the construction of intelligent and high-throughput breeding decision-making systems. Keywords Soybeans; Genomic prediction; Machine learning; Yield traits; Protein content 1 Introduction Soybean (Glycine max) has always been the main force in the global food system, especially in providing protein and oil. It is indispensable whether for human consumption or as feed. At present, the demand for plant protein is increasing, and the added value of soybeans has begun to gradually expand in food, industry and even global food security (Van Der Laan et al., 2024). Strong adaptability, high protein content, and the ability to grow in different climates make it the first choice for filling nutritional gaps in the face of climate change and dietary structure changes (Gill et al., 2022). However, traditional breeding does not always keep up with the pace of such demands. Breeding high-yield and high-protein varieties is no easy task - these traits usually involve multiple genes and often interact with the environment. Just the phenotypic identification stage alone is time-consuming and laborious. Moreover, the long breeding cycle and high cost have further hindered the acceleration point (Ray et al., 2022). For this reason, an increasing number of studies are beginning to shift towards genomic prediction (GP) and machine learning (ML) technologies. With the support of high-throughput phenotypic technologies and genomic data, these new tools can not only enhance the efficiency of selection but also demonstrate stronger predictive power than traditional statistical models in the context of multi-data fusion. Especially for methods like random forests and deep learning, when combined with genomic, phenotypic and environmental variables, the prediction effect is often more stable (Yoosefzadeh-Najafabadi et al., 2021). The emergence of such models can also be regarded as a "paradigm shift" in the field of breeding. This study explored the application of genomic prediction and machine learning models in predicting soybean yield and protein traits, compared the predictive performance of various genomic and machine learning models,

RkJQdWJsaXNoZXIy MjQ4ODYzNA==