CGG2025v16n3

Cotton Genomics and Genetics 2025, Vol.16, No.3, 148-162 http://cropscipublisher.com/index.php/cgg 150 at one time (Sun et al., 2022). However, strict quality control (QC) is crucial for both chips and sequencing data. Common QC steps include: removing markers with high missing and error rates, filtering low allele frequency (MAF) markers to reduce background noise, and removing samples with abnormal heterozygosity or duplicate identity. High-quality data can ensure the reliability of model training. It is reported that in cotton GS research, only thousands to tens of thousands of SNPs with high quality are usually selected for modeling. It is also necessary to pay attention to the impact of homologous fragments and structural variations unique to polyploid cotton on typing. The recently constructed cotton genome variation database (CottonGVD) integrates a large amount of SNP, Indel and structural variation information of different cotton species. Using these resources, breeders can obtain the genotype of the target material more comprehensively. In essence, the quality of genotype data will directly affect the accuracy of the prediction model. Only by conducting strict QC on the basis of fully understanding the genetic variation and diversity structure of cotton can a solid foundation be laid for subsequent GS modeling (Peng et al., 2021). 2.3 Association analysis methods between genetic variation and prediction models The association analysis between genetic variation and trait phenotype is a key link in building genomic prediction models. In cotton breeding, traditional genome-wide association analysis (GWAS) and quantitative trait mapping (QTL mapping) have been widely used to discover gene loci that affect yield, quality, and resistance (Tan et al., 2024). However, GWAS can usually only detect a few significant main effect loci, and it is difficult to capture most of the minor effect genes for complex quantitative traits. The genomic prediction model improves the utilization efficiency of minor effect genes by integrating the effects of all markers. Commonly used GS statistical models include: GBLUP, RR-BLUP, BayesA/B/C, LASSO, etc. These methods are similar to multivariate regression in principle, but the difference lies in the different prior assumptions about marker effects. In practice, the model can be selected according to the genetic architecture of the trait. For example, for traits such as fiber quality that may be controlled by fewer large-effect QTLs, the Bayesian model sometimes performs better; while for traits such as yield that are highly controlled by multiple genes, RR-BLUP and other models that assume uniform minor effects are more robust (Budhlakoti et al., 2022). Machine learning algorithms such as random forests (RF) and support vector machines (SVM) do not need to assume linear additivity and can capture nonlinear interactions between markers. They are introduced into GS to improve prediction capabilities. For example, a study used a machine learning model to successfully predict the response genes of cotton under low temperature stress, with an accuracy significantly higher than that of the traditional linear model. In the process of GS model training, cross-validation or independent validation sets are usually used to evaluate the prediction accuracy (such as correlation coefficient or root mean square error). It is worth noting that cotton is an allotetraploid, and the interaction between its A and D subgenomes may affect the prediction model. Homologous genes and linkage disequilibrium structure should be fully considered. Based on association analysis and combined with genetic parameters (such as marker variance and heritability), the training set selection and weighted modeling strategies can also be optimized (Billings et al., 2022). The association discovery and modeling of genetic variation and phenotype is a process of continuous iterative optimization. Reasonable selection of analysis methods and adjustment in combination with cotton genetic characteristics can improve the performance of genomic prediction models and promote the successful implementation of predictive breeding. 3 Application of AI Algorithms in Cotton Breeding 3.1 Machine learning algorithms Machine learning, with its powerful nonlinear modeling capabilities, is increasingly becoming a powerful tool for cotton breeding data analysis. Classic machine learning algorithms such as random forest (RF), support vector machine (SVM), gradient boosting tree, etc. have been used in cotton genome prediction and trait mining research. For example, the random forest algorithm can evaluate the relative importance of each gene marker to the trait by integrating a large number of decision trees, thereby achieving accurate prediction of complex quantitative traits. Dhaliwal et al. (2022) used random forest combined with long-term field trial data to successfully predict the yield performance of cotton under different conservation tillage measures. The model not only gave high-precision

RkJQdWJsaXNoZXIy MjQ4ODYzNA==