CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 141 Research Insight Open Access Pretrained Language Models for Biological Sequence Understanding Haimei Wang Hainan Institute of Biotechnology, Haikou, 570206, Hainan, China Corresponding author: haimei.wang@hibio.org Computational Molecular Biology, 2025, Vol.15, No.3 doi: 10.5376/cmb.2025.15.0014 Received: 02 Apr., 2025 Accepted: 13 May, 2025 Published: 04 Jun., 2025 Copyright © 2025 Wang, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.6 Preferred citation for this article: Wang H.M., 2025, Pretrained language models for biological sequence understanding, Computational Molecular Biology, 15(3): 141-150 (doi: 10.5376/cmb.2025.15.0014) Abstract Pre-trained language models (PLMS) are increasingly becoming innovative tools in life sciences, capable of autonomously learning rich representations from massive amounts of biological sequence data. They capture complex patterns and long-term dependencies in DNA, RNA and protein sequences through self-supervised training, effectively compensating for the limitations of traditional bioinformatics methods. This paper reviews the progress of PLM in the field of biological sequence understanding, covering the model principles and their applications in protein function prediction, gene expression regulation, and structural modeling, etc. It focuses on discussing the case of using the ESM-2 model to predict the impact of protein stability mutations and its comparison with traditional methods. Finally, this paper analyzes the challenges such as data sparsity, model interpretability and computational cost, and looks forward to the development prospects of the deep integration of artificial intelligence and molecular biological science. These advancements indicate that pre-trained models are leading a transformation in the research paradigm of biological sequences. Keywords Pre-trained language model; Biological sequence; Protein function prediction; Gene regulation; Protein structure prediction 1 Introduction In recent years, with the development of high-throughput sequencing technology, the volume of biological sequence data has grown explosively. The vast repository of DNA, RNA and protein sequences offers unprecedented opportunities for data-driven approaches. However, most of these sequences lack experimental annotations, and the biological laws they contain are difficult to be fully explored through traditional means. The success of natural language processing (NLP) offers an insight into this predicament: if we can "read" biological sequences as we understand human language, it is possible to extract implicit structural and functional information from them. Pre-trained language models were introduced into the field of bioinformatics precisely for this motivation. They learn the complex relationships between sequence elements through self-supervised pre-training on massive sequences and are regarded as artificial intelligence tools that "understand" the language of life. This method is expected to reveal the "grammatical" rules behind sequences, such as how the combination of amino acids determines protein folding and function, or how the arrangement of nucleotides affects gene regulation (Yang et al., 2023; Wang et al., 2024). In conclusion, the application of language models in biological sequence analysis aims to fully utilize massive data to mine the deep information of biological sequences, providing a brand-new perspective and technical means for understanding life systems. Traditional bioinformatics methods have many limitations when dealing with biological sequences. First of all, most traditional tools rely on artificially designed features and simplified assumptions. For instance, protein functions are often inferred through homologous sequence alignment, and information on conserved sites is captured using sequence similarity or position-specific scoring matrices (PSSM). However, this strategy based on alignment often fails for sequences lacking known homology and cannot discover functional relationships in distant sequences (Song et al., 2021). Furthermore, some feature representation methods (such as single-hot encoding or fixed-length k-mer fragments) fail to reflect the context association of the sequence, ignoring the remote dependencies and advanced patterns in the sequence. Take the identification of gene regulatory elements

RkJQdWJsaXNoZXIy MjQ4ODYzNA==