LGG_2025v16n2

Legume Genomics and Genetics 2025, Vol.16, No.2, 91-99 http://cropscipublisher.com/index.php/lgg 93 2.3 Genetic control of protein content and amino acid composition Protein is not only a matter of "how much", but also of "what kind". The control mechanisms of content and amino acid composition are actually quite different. Alleles of varieties like Danbaikong have high-protein QTL on chromosome 20. In some contexts, protein can be increased without significantly reducing yield, although there is often a certain negative correlation between protein and oil and yield (Patil et al., 2017). Studies have found that the genes influencing the total protein content and the mechanisms regulating the amino acid profile may be separate, but the two often run in the same network (Duan et al., 2023; Hu et al., 2023). This is also why the properties of proteins seem particularly "tangled". However, there have been breakthroughs in genomics, transcriptomics and proteomics now. Guo et al. (2022) and Liu et al. (2023) have identified many candidate genes and regulatory modules, laying the foundation for the subsequent breeding of high-protein and nutritionally fortified soybean varieties. 3 Overview of Genomic Prediction and Machine Learning Methods 3.1 Common machine learning algorithms (RR-BLUP, SVM, RF, GBM, DNN, etc.) When it comes to genomic prediction, the most commonly used models at the beginning were actually some rather "old-fashioned" linear models, such as RR-BLUP or GBLUP. These models are fast in operation and straightforward in logic, especially when traits are mainly controlled by additive effects (Rao et al., 2025). However, when it comes to non-additive effects or complex interactions, these linear methods become a bit challenging. At this time, SVM and SVR, which can handle nonlinearity, come in handy (Wang et al., 2022). Later on, ensemble methods such as Random Forest (RF) and GBM began to gain popularity because they are better at capturing complex relationships. In recent years, deep learning has also joined the fray. Technologies like MLP, CNN, and even DNN have been employed to capture higher-order patterns from a vast array of labels, especially when non-additive structures are quite evident (Montesino-Lopez et al., 2021). Of course, they are not always "invincible". Many times, it still depends on the volume of data and the type of trait. 3.2 Feature selection and dimensionality reduction techniques (PCA, LASSO, Boruta, etc.) High-dimensional data is much more troublesome to process. An obvious problem is that when there are too many features, the model is prone to deviation. PCA is often used to compress the dimensions first, trying to retain most of the variations while eliminating the redundanities (Monaco et al., 2021; Conard et al., 2023). Regularization methods like LASSO are also commonly used. They automatically compress variables with small weights to 0, thereby filtering out irrelevant features (Lourenco et al., 2022). Packaging methods like Boruta are even more "meticulous". They repeatedly compare the true features with the scrambled "false features" and pick out the most contributing part from them (Lopez et al., 2023). These techniques are particularly crucial for deep learning because such models can easily overfit if they have too many features and insufficient samples. 3.3 Model training, cross-validation, and generalization performance evaluation Whether a model is well-trained or not cannot be judged merely by the performance of the training set. Cross-validation is a basic skill. Whether it is the K-fold or the retention method, the purpose is to test whether its performance on new data can still be stable (Lopez et al., 2023). Parameter tuning is also indispensable, especially for nonlinear models. For instance, grid search or automatic optimization using AutoML has basically become a standard feature. In terms of evaluation, indicators such as predictive correlation, RMSE or AUC are generally considered (Abdollahi-Arpanahi et al., 2020). Nowadays, many people still use ensemble prediction, combining the results of multiple models. Although it is more complex, the results are usually more stable (Azodi et al., 2019). As for the interpretability of models, there are now ways to open up "black box" models like deep learning. Tools like SHAP can tell you which variable contributes the most in the prediction (Watson, 2021). 4 Model Construction and Prediction Performance Comparison 4.1 Input variable design: SNP encoding and integration of environmental factors Not all input data can be directly used for prediction. For complex traits such as soybean yield and protein, the variables need to be designed before modeling. SNP information often needs to be encoded first - sometimes

RkJQdWJsaXNoZXIy MjQ4ODYzNA==