CGG2025v16n3

Cotton Genomics and Genetics 2025, Vol.16, No.3, 148-162 http://cropscipublisher.com/index.php/cgg 151 predictions, but also provided an explanation of the factors affecting yield, which has reference value in actual agronomic decision-making. Support vector machines have outstanding performance in small sample modeling and high-dimensional data processing. Studies have used SVM combined with cotton gene expression data to predict disease resistance genes with a high accuracy rate. In addition, clustering algorithms can be used for genetic diversity analysis and kinship division of cotton germplasm resources, providing a basis for combining parents. It should be pointed out that machine learning models often have many hyperparameters, which need to be optimized through methods such as cross-validation to prevent overfitting and improve generalization ability. In cotton genomic selection, the introduction of machine learning algorithms helps to capture non-additive interaction effects between markers and improve prediction accuracy. For example, Zhao et al. (2023) integrated machine learning methods into gene regulatory network analysis to identify key control genes affecting cottonseed yield. This shows that machine learning can not only be used to predict trait values, but also to discover important breeding factors. In breeding practice, machine learning models can also integrate phenotypic imaging data to achieve automatic measurement and evaluation of cotton agronomic traits. 3.2 Construction and optimization of deep learning models As a subfield of machine learning, deep learning is characterized by multi-layer neural networks and can automatically extract complex data features. It is emerging in cotton genetic improvement research. Compared with traditional machine learning, deep learning models (such as convolutional neural networks CNN, recurrent neural networks RNN, graph neural networks GNN, etc.) have end-to-end learning capabilities and are particularly suitable for processing large-scale, high-dimensional genomic and phenotypic data. In cotton breeding, a typical application of deep learning is to combine high-throughput phenotypic imaging for trait prediction. For example, Li et al. (2024) used a deep convolutional neural network to analyze high-throughput image data of cotton fruit branch angles, extracted phenotypic characteristics related to genotypes, and then combined GWAS to locate key genes affecting fruit branch angles, greatly improving the efficiency of trait genetic analysis. This combination of "deep phenotype+genome" provides a new paradigm for quantitative trait improvement. In addition, deep learning can also be directly used to build genomic prediction models. Budhlakoti et al. (2022) developed the DeepGS model, which inputs the whole genome SNP sequence into a multi-layer neural network to capture the complex relationship between markers and phenotypes in a nonlinear way, showing better prediction accuracy than GBLUP. In cotton, Zhao et al. (2024)'s research is unique: they used a combination of convolutional neural networks and Transformers to develop the DeepFDML deep model, which specifically predicts functional methylation sites in the cotton genome. DeepFDML trained a CNN-Transformer hybrid network on thousands of known functional methylation sites, and ultimately increased the model's area under the ROC curve from 0.65 to 0.82, significantly outperforming traditional methods. This shows that deep learning has unique advantages in mining complex "gene-epigene-phenotype" relationships. Of course, deep models often require massive data support, and the training process is computationally intensive, requiring the use of high-performance computing resources such as GPUs (Yan et al., 2024). In terms of model optimization, regularization, Dropout and other techniques can alleviate overfitting, and hyperparameter adjustment and architecture improvements (such as adding attention mechanisms, autoencoder pre-training, etc.) can improve model performance. 3.3 Integration and analysis of multi-environment data In the actual breeding process, different ecological environments have a significant impact on the performance of cotton traits, so how to integrate multi-environment data to improve the generalization ability of the model is an important topic. In traditional breeding experiments, multi-point ring tests are often used to evaluate the adaptability and stability of varieties. However, incorporating environmental factors into genomic prediction models remains challenging because environmental variables are often difficult to quantify and there is an interaction between genes and the environment (G×E). Current research shows that genomic predictions in multiple environments can make progress by combining statistical models and machine learning. For example, the so-called “reaction paradigm model” adds environmental covariates to GS or constructs environmental principal components to explain G×E variation in genomic prediction (Budhlakoti et al., 2022). In cotton, methods such as Jarquín have been used to improve predictions of yields in different locations, but their application is not yet widespread.

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==