CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 112-121 http://bioscipublisher.com/index.php/cmb 113 genomics, decoding the mechanisms of gene regulation and promoting precision medicine in the future might be a key approach (Drusinsky et al., 2024). 2 Genomic Sequence Data and Gene Expression Characteristics 2.1 Sources and processing methods of genomic sequences For those engaged in genomic research, the first step is often to "find data". Most of these sequences come from reference genomes in public databases, or they may be the results of sequencing projects that have been made public. After obtaining the data, it's not as simple as simply throwing it into the model and calling it a day - you need to first perform preprocessing such as format conversion and quality checks. Which specific segments to take depends on the research objective: for instance, if the aim is to analyze the promoter, a sequence upstream of the gene would be extracted. If remote control is to be considered, the more distant areas should also be taken into account. Next comes the encoding required by the model, converting the four bases A, C, G, and T into digital form. A common practice is single-hot encoding, where each base is represented by a sparse vector of length 4. Only after all these steps are completed can the sequence be sent into the deep learning model for training (El-Tohamy et al., 2024). 2.2 Measurement and standardization of gene expression data To measure gene expression, RNA sequencing (RNA-SEq) is now widely used, while in the past, the chip method was more common. RNA-seq will count the number of transcripts of each gene, but these raw readings are not reliable for direct comparison with different samples and need to be standardized first. Common metrics include RPKM and FPKM. Some people prefer TPM because it is more convenient for cross-sample comparisons (Zhao et al., 2020). The purpose of standardization is to balance out the differences caused by sequencing depth and gene length, so that the expression levels of samples can be compared on the same table. If necessary, logarithmic transformation can also be performed on the expression matrix or batch effects can be processed to make the data cleaner. The processed expression data can eventually be paired with genomic sequences to train the model (Zhao et al., 2021). 2.3 Dataset integration and feature extraction strategies When conducting deep learning modeling, data must not be scattered here and there. Sequences and expression results from different sources often have to be pieced together into a unified dataset in the end to ensure sufficient sample size and diversity. However, integration is not simply splicing. Data from different platforms need to be batch corrected and normalized first; otherwise, they simply cannot be compared. There are also many considerations when preparing the features for model input. Although deep learning can directly learn from the original sequence end-to-end, if there is not enough data, adding some artificial features appropriately can be helpful, such as the k-mer frequency of the sequence, the GC content, or the known binding sites of transcription factors. When it comes to integrating multi-omics data, it is also necessary to align information and sequences such as chromatin accessibility and histone modifications one by one according to the positions of genes or genomes. Only when these integrations and feature extractions are done well can the model better capture biological signals. 3 Fundamentals and Model Architecture of Deep Learning 3.1 Overview of common deep learning models (CNN, RNN, transformer) For genomic sequence analysis, the common models are actually just a few types: convolutional neural networks, recurrent neural networks, and the Transformer, which has been very popular in recent years. Let's start with the Transformer. It relies on its self-attention mechanism to connect the relationships at various positions in a sequence regardless of the distance, and can handle regulatory information of hundreds of kilobytes (Lan, 2024). CNN, on the other hand, is more adept at mining local features, such as short motifs in DNA, which are frequently used in such tasks. RNNS (like LSTM) are originally strong at handling sequential data and can capture certain long-term dependencies. However, when it comes to extremely long genomic sequences, they struggle a bit, so their application is not as frequent. Overall, CNN is efficient, RNN can retain sequential information, and

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==