CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 112-121 http://bioscipublisher.com/index.php/cmb 115 segments. Deep CNNS are very good at finding short sequence motifs from these encoded sequences, and the convolutional kernels seem to be secretly capturing the binding sites of transcription factors (Choong and Lee, 2017). The gameplay of Transformer is different. Usually, position encoding is added to enable the model to perceive the relative relationships at various points in the sequence. In recent years, some people have even treated DNA as a kind of "language", using word embeddings or pre-trained models to extract features (Chen et al., 2020). Only when the coding is well selected can the model's efficiency in utilizing regulatory information be high. 4.2 End-to-end sequence input gene expression prediction model Some studies simply throw the original sequence directly into the model and let it calculate the gene expression value by itself. This end-to-end approach has been proven to work. For instance, some people have used CNN to directly predict the expression levels in different tissues from the whole genome sequence, and the accuracy is quite high. The Enformer that emerged later was even more powerful. It introduced the Transformer and could process sequence information up to 100kb at a time. The correlation between the prediction results and the actual expression was significantly better than that of traditional CNNS, and it could also infer the impact of non-coding variations on the expression (Stefanini et al., 2023). Many such end-to-end models conduct multi-task learning to simultaneously predict expressions under multiple organizations or conditions in order to enhance generalization ability. They can automatically interpret the regulatory information in the sequence without the need for manual feature extraction, thus becoming a powerful tool for studying the regulation of gene expression (Ramprasad et al., 2024). 4.3 Comprehensive prediction method combining multi-omics data In fact, DNA sequences alone cannot explain all the differences in gene expression. This is something everyone has long realized. As a result, multi-omics integrated prediction methods have gradually gained popularity in recent years. Researchers will incorporate epigenetic information, three-dimensional genomic structure and other data into sequence models to make the regulatory background more complete. For instance, if chromatin accessibility and histone modification data are input as additional features along with the sequence, it can help the model determine which fragments are active (Dong et al., 2024b). Some people have also incorporated the three-dimensional interaction frequency of chromatin, making it easier for the model to identify the effect of distal enhancers on genes (Merelli et al., 2015). The actual results are quite astonishing: After introducing remote interaction information, the relevant performance of gene expression prediction soared from 0.46 to 0.93. Overall, this type of multi-omics integration model not only makes more accurate predictions but also brings more biological details. 5 Model Performance Evaluation and Interpretive Analysis 5.1 Performance evaluation metrics and benchmark testing There is no single "universal indicator" for evaluating gene expression prediction models. A common practice in research is to conduct a Pearson correlation between the predicted values and the measured values. This linear correlation coefficient has almost become a mandatory data in many papers (Mikhaylova and Thornton, 2019). Meanwhile, indicators that measure deviation, such as mean square error, are often taken into account incidentally (Ji et al., 2023). If the problem is changed to a classification of high and low expression, the evaluation method will change again, and indicators such as accuracy rate and recall rate will have to come in handy. Generally, the generalization ability of the model is tested on completely independent test data. The same dataset is also used to compare this method with traditional machine learning models or previous deep models to see how much improvement there is. Some teams even directly cross-validate the prediction results with large expression databases, such as GTEx's multi-organization expression profiles, to see if the model can also work for real data. Only by combining multiple indicators and strict benchmark tests can the predictive level of the model be seen more clearly. 5.2 Model interpretability techniques (feature visualization, attention mechanism, etc.) The predictions made by deep learning are often very accurate, but researchers have never been able to figure out exactly what the model "thinks". To explain these black boxes, people have come up with many solutions. By

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==