CGG2025v16n3

Cotton Genomics and Genetics 2025, Vol.16, No.3, 148-162 http://cropscipublisher.com/index.php/cgg 158 country's cotton intelligent breeding system is still in its early stages, but under the dual promotion of the government and the market, the relevant infrastructure is gradually being improved. The national level has deployed a major "digital seed industry" project to support the research and development of crop intelligent breeding technology, among which cotton is one of the key targets. It can be foreseen that in the near future, a phenotypic big data network covering major cotton-growing test stations, a database integrating the genome information of major domestic cotton germplasm, and a set of open and shared AI breeding tools will be established to empower the majority of breeders. China's intelligent breeding system will give full play to its latecomer advantage, learn from foreign experience and integrate local massive data to achieve leapfrog development. This will effectively promote the improvement of the breeding level of new cotton varieties in my country and ensure the sustainable and healthy development of the national cotton industry. 6. Challenges and Future Directions 6.1 Limitations in data quality and model generalization Although artificial intelligence has shown great prospects in cotton breeding, it still faces many challenges. The first is the problem of data quality. High-precision genotype and phenotypic data are the basis for establishing reliable prediction models, but it is not easy to obtain large-scale, high-quality data in actual breeding. For example, field management and measurement errors at different test sites will cause phenotypic data noise, and there may also be missed variants or typing errors in genotyping (Ma et al., 2021). These data noises will directly affect the training effect and prediction accuracy of the model. Therefore, it is necessary to improve data quality through repeated experiments, standardized measurements and strict data cleaning. As an allopolyploid, cotton has a complex genome that makes accurate typing difficult, and some structural variants and homologous fragments may not be detected or correctly located (Sun et al., 2022). This lack of information will weaken the model's explanatory power for traits. The second is the limited generalization ability of the model. A model trained in a specific population and environment is often difficult to directly apply to materials with large differences in genetic basis or under different ecological conditions. For example, the drought prediction model established on Xinjiang data may not be applicable to varieties in the cotton region of the Yellow River Basin. Therefore, the model needs to have certain transfer learning and adaptive capabilities. Many current GS models will perform significantly worse outside the training set, which is one of the practical problems facing cotton AI breeding (Liu and Huang, 2022). To improve the generalization of the model, we can consider: increasing the diversity of training data to cover more genetic backgrounds and environments; introducing hierarchical models to embed population division or environmental classification into the model structure; and using ensemble learning to improve robustness by fusing multiple models. High-dimensional labeled data can easily lead to model overfitting, and feature selection or regularization methods are needed to constrain model complexity. Another problem is that AI models are highly dependent on input data. When the genotype of the new material has allelic variation that does not appear in the training set, the model may not be effectively used. This suggests that we should continuously update the training data and model parameters to keep them synchronized with the latest genetic diversity information. Although the data and model challenges are significant, they are not insurmountable. With the promotion of the construction of agricultural big data platforms at the national level, cotton breeders will be able to share richer data resources in the future. At the same time, the development of machine learning is also providing new algorithms to improve the ability of small sample learning and cross-domain generalization. As long as we face these shortcomings and actively improve them, artificial intelligence will surely serve cotton breeding practice more maturely. 6.2 Challenges in integrating and analyzing multi-omics data The formation of cotton traits involves multi-level information such as genome, transcriptome, epigenome, metabolome and environmental factors. How to effectively integrate multi-omics data to improve breeding models is a frontier topic in current artificial intelligence breeding. On the one hand, multi-omics data is huge and of different types. For example, a cotton variety may have hundreds of millions of DNA methylation sites, tens of thousands of expressed genes and thousands of metabolites. These data dimensions far exceed traditional genotypes and phenotypes, and fusion analysis requires powerful computing power and new algorithms. On the

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==