CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 112-121 http://bioscipublisher.com/index.php/cmb 114 Transformer covers the entire world. Researchers usually choose or mix and match according to the task requirements (Figure 1) (Almotairi et al., 2024). Figure 1 Hybrid transformer-CNN architecture for predicting hemolytic activity of peptides (Adopted from Almotairi et al., 2024) 3.2 Model selection and architecture design principles When developing a gene expression prediction model, there is no "one-step" formula, but some basic ideas still cannot be bypassed. First, let's talk about the matching between the model and the input sequence: If the main regulatory information is at the proximal end of the gene, a model like CNN that is good at capturing local features is sufficient. However, once the remote enhancers come into play, it is necessary to consider architectures that can handle long-distance dependencies, such as Transformers (Tu et al., 2024). When considering the complexity of the model, don't forget to look at the amount of data - if the data is small, don't build a too deep network. Appropriate use of regularization is more stable. Only with a large number of samples can there be space to stack deeper layers. Some people also use multi-task learning, allowing the model to predict multiple related outputs at once, and the generalization ability is often better (Zeng et al., 2015). Finally, do not overlook biological knowledge. Incorporating these prior information appropriately can also enhance the interpretability of the model. By adhering to these practices, the performance and biological rationality of the model will be more guaranteed. 3.3 Model training, validation and hyperparameter optimization When training gene expression prediction models, it is generally impossible to avoid the set of supervised learning: as long as there are known gene sequences and corresponding expression values, they can be used as samples. Don't think that just throwing it in and it's done. There are still many details in the process. For instance, the data must first be divided into a training set, a validation set and a test set. The validation set is used to adjust the model structure and hyperparameters, and also to prevent overfitting (Makarova et al., 2021). Sometimes an early stop strategy is added to prevent the model from over-learning (Dorka et al., 2023). The loss function also depends on the type of task. For regression, mean square error is commonly used, while for classification, cross-entropy is often employed. As for hyperparameters - such as learning rate, network depth, and regularization coefficient - they are usually repeatedly adjusted on the validation set through grid or random search. Finally, it is the turn of the independent test set to verify the effect. Only by completing such a round can a model with stable performance and strong generalization ability be obtained. 4 Prediction Methods Based on Deep Learning 4.1 Identification and coding strategies of gene regulatory elements There is not just one approach to handling DNA sequences. The most common one is, of course, single-heat coding, which converts bases into sparse vectors and allows the model to learn the characteristics of the regulatory elements on its own. Sometimes, however, researchers also carry a bit of prior knowledge and directly label potential regulatory regions or specific motif positions in the input to ensure that the model does not miss key

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==