CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 112-121 http://bioscipublisher.com/index.php/cmb 117 the model considers key, and then directly measure the expression changes (Lin and Jiang, 2021). The results are very consistent with the predictions. Only such independent verification can prove that the output of the model truly has biological significance and make it more convincing in scientific research and clinical applications. 6 Challenges and Latest Developments 6.1 Data sparsity and sample bias issues In the field of gene expression prediction, for deep learning to be trained well, there must first be a sufficient amount and diversity of data, but this is precisely the difficulty. High-quality expression profiles are not easy to obtain. The cost of covering a species comprehensively is not low, and the sample size often fails to keep up with the appetite of super-large models. Even if data is collected, there are still other troubles: the number of highly expressed and low-expressed genes is often unbalanced, the model tends to favor the majority class, and unpopular low-expression patterns may be directly ignored (Wang and Hu, 2023). If the training set is overly focused on certain organizations or experimental conditions, it will also cause the model to generalize poorly to situations it has never seen before. Not to mention the noise in the experimental measurement. Once the model is overly fitted with this noise, its generalization ability will be dragged down. To solve these problems, we need to focus on both sides: on the one hand, continuously accumulate larger and more diverse datasets; on the other hand, in training, adopt strategies such as data augmentation and loss weighting to minimize bias as much as possible, making the model more stable and more generalized (Jaichitra et al., 2023). 6.2 The balance between model interpretability and biological significance Although deep learning has strong predictive capabilities, once the model becomes too complex, researchers find it difficult to figure out exactly how it reaches its conclusions, which has become a headache for many people. Everyone not only wants to know what results the model gives, but also wants to figure out "why" there is such an output. If it is completely like a black box, no matter how high the accuracy rate is, it will not help much in promoting the understanding of biology. Some people have attempted to work on the model structure, such as adding interpretable modules or prior constraints, to link the internal structure of the network with biological processes, making it naturally more transparent (Zhang et al., 2022). Some teams also directly use visualization methods to break down the black box, looking for clues from the motifs or attention distributions learned from the model, and then convert them into biological hypotheses for verification (Hanczar et al., 2020). Just enhancing interpretability should not lower performance. This requires the entire community to come up with solutions together, developing new networks and algorithms that not only maintain prediction accuracy but also make the model more "explainable". 6.3 Frontier research trends and potential breakthrough directions Research on gene expression prediction has been constantly emerging with new ideas. In recent years, a notable trend has been the integration of pre-trained large models into genomics. Researchers first train general deep models on massive genomic sequences and then fine-tune them for specific tasks, such as predicting gene expression. DNABERT is a case in point. It pre-trains DNA as a "language", providing a strong representation of sequence features for downstream predictions (Dong et al., 2024a). Meanwhile, generative deep learning has also begun to be applied to regulatory sequence design. With the help of these generative models, scientists can synthesize new regulatory elements with specific expression effects, which holds great promise in gene therapy and synthetic biology (Yang et al., 2025). Of course, algorithms, computing power and data are all constantly advancing. New models and methods are expected to follow one after another, helping us gain a deeper understanding of gene regulatory networks and also broadening the possibilities of practical applications. 7 Case Study: Prediction of Gene Expression in Specific Species 7.1 Case background and research objectives For this case, we chose corn (Zeamays), which is a commonly used model crop. Its genome is large and complex, and gene expression is also influenced by multiple layers of regulation, making prediction even more challenging. Especially for those distant enhancers, they influence gene expression through chromatin spatial structure. However, traditional methods often only look at the proximal sequences of genes, and three-dimensional genomic

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==