CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 142 as an example. Traditional methods rely on the scanning of known short sequence motifs, making it difficult to comprehensively consider the impact of a broader sequence background on gene expression. Furthermore, although physical and structural simulation methods (such as protein molecular dynamics or energy functions) are precise, their computational costs are high, making it difficult to apply them on a large scale at the whole-genome or proteome scale. In conclusion, traditional methods are often confined to local or existing knowledge and lack the ability to automatically extract deep patterns from massive sequences, which restricts a comprehensive understanding of the functions of biological sequences. This article aims to comprehensively review the current situation and trends of pre-trained language models empowering biological sequence research, providing useful references for researchers in related fields. 2 Background on Biological Sequences and Representation 2.1 Structure and characteristics of DNA, RNA, and protein sequences The sequences of biological macromolecules include nucleic acid sequences (DNA and RNA) and protein sequences, which each have their own characteristics in structure and properties. DNA (deoxyribonucleic acid) is composed of four nucleotides (A, T, C, and G), and usually exists in the form of A double helix and double strands. The two opposite parallel strands maintain a stable structure through base pairing (A-T, C-G). DNA sequences carry genetic information and generate RNA through transcription. RNA (ribonucleic acid) is composed of four bases: A, U, C, and G. It is generally a single-stranded structure, but it can form secondary structures such as hairpins locally. RNA in cells not only serves as a messenger for gene expression but also has catalytic or regulatory functions. DNA and RNA sequences mainly function by encoding proteins or regulatory elements (Pan and Shen, 2018). Protein sequences are composed of 20 kinds of amino acid residues and are the products of gene translation. The type and sequence of amino acids (primary structure) determine how proteins fold into specific spatial conformations (tertiary structure) and thereby perform biological functions. Protein sequences possess diverse chemical properties: The differences in hydrophobicity and charge among various amino acids enable proteins to form complex secondary structures (α -helices, β -folds, etc.) and domain modules. There are often specific conserved motifs or functional domains in sequences that are crucial for protein functions. Therefore, a protein sequence is not merely a string of letters; it also contains rich structural and functional information. Understanding the composition and characteristics of DNA, RNA and protein sequences is the foundation for applying computational models to analyze biological sequences(Helaly et al., 2020). 2.2 Traditional encoding schemes (e.g., one-hot, k-mer, PSSM) When applying computational models to analyze biological sequences, it is first necessary to convert the sequences into digital representations. Traditionally, researchers have proposed various intuitive coding schemes to represent DNA, RNA or protein sequences. One-hot encoding is the most fundamental representation method, which represents the identity of each base or amino acid with a high-dimensional sparse vector (for example, a 20-dimensional vector is used for a protein sequence, and only the position of the residue is 1). Single-heat encoding is simple and straightforward, and is often used as the input feature of traditional machine learning models. However, its drawback is that it fails to reflect the similarity between symbols and does not contain any contextual information (Gupta et al., 2024). The K-mer fragment representation method splits the sequence into consecutive subsequence fragments according to a window of fixed length k, and encodes these fragments as lexical units. For example, DNA sequences can be represented by hexonucleotide fragments of k=6, and each sequence is regarded as a collection of these 6-mer "words". By statistically analyzing the k-mer frequency or mapping the k-mer to an embedding vector, local sequence patterns can be captured to a certain extent. k-mer is widely applied in tasks such as genomic sequence classification and motif discovery. However, it should be noted that k-mer only focuses on local fragments of length k, and long-range relationships beyond the window cannot be reflected (Ng, 2017, Matougui et al., 2020).

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==