CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 145 Masked language modeling (MLM) is a pre-training task borrowed from BERT. It randomly masks some symbols in the sequence and requires the model to predict the original symbols at the masked positions based on the context(Uribe et al., 2022). In the context of protein sequences, it is common to randomly cover, for example, 15% of amino acids, allowing the model to fill in these gaps, thereby learning the internal constraints and patterns of the sequence. Successfully predicting the masked residue means that the model has captured the restrictive relationship of the remaining sequences to this position (similar to inferential the amino acid at a certain position through other residues in multiple sequence alignment). DNA sequences can also be applied to MLM tasks, such as randomly covering some bases, allowing the model to restore the original bases based on the sequences on both sides, thereby learning the language rules of genomic sequences. The task of autoregressive sequence modeling (Next-Token prediction) requires the model to gradually predict the Next symbol in the natural sequence order, similar to the training method of GPT. In biological sequences, autoregressive pre-training can be used to construct sequence generation models. For instance, by training a Transformer to predict the next amino acid based on the previous ones, the model gradually learns to generate sequence fragments similar to natural proteins. Such generative models can be used to design entirely new protein sequences or expand partial sequence fragments. For DNA, autoregressive models can simulate the continuation of chromosome sequences, thereby capturing the statistical characteristics of genomic sequences. These self-supervised tasks make full use of the rich data of biological sequences and can train models without manual annotation. In addition, there are studies exploring other variant tasks, such as masking consecutive fragments, predicting the position of sequence fragments in the entire sequence, or performing denoising and reconstruction on randomly disturbed sequences, etc (Luo et al., 2020). The common goal of these pre-training tasks is to enable the model to learn the internal structure and semantics of sequences to the greatest extent possible without the need for supervised signals, so as to provide a universal representation for downstream biological analysis. 3.3 Differences between natural language and biological sequence modeling There are some significant differences in modeling between natural language and biological sequences. Firstly, the scale of the basic alphabet of biological sequences is much smaller than that of human language vocabulary (for example, DNA has only four bases). Although the vocabulary can be expanded through the k-mer method, the model needs to capture more combination patterns of a small number of letters. Secondly, biological sequences are often extremely long and lack clear separation structures (such as genomic DNA where thousands of bases are linked together), and models need to handle remote dependencies and implicit hierarchical structures, unlike language sentences which have clear grammatical boundaries. Furthermore, the "semantics" of biological sequences correspond to their biological functions, and the model's understanding of sequences needs to be measured by the performance of downstream tasks. In contrast, the semantic learning of natural language models can be verified by directly judging the meaning or grammatical rationality of sentences. Finally, there is a large amount of homologous redundancy in biological sequence data, which may not only lead the model to lean towards frequent patterns, but also provide an opportunity to utilize evolutionary correlation information (such as simultaneously inputting a group of homologous sequences to learn co-changes). Therefore, while drawing on NLP models, it is necessary to make corresponding adjustments to the model architecture and training strategies in response to these differences to adapt to the unique attributes and challenges of biological sequences. 4 Applications in Biological Sequence Understanding 4.1 Protein function prediction and variant effect analysis Pre-trained protein language models provide a new approach for protein functional annotation. Traditional methods mainly infer protein functions based on sequence similarity, while large language models can extract deep features from sequences and predict functional attributes even in the absence of obvious homologous sequences. For instance, when the protein sequences generated by the model are embedded for classification tasks, they perform well in distant homology detection and functional domain prediction, significantly enhancing the functional recognition ability of rare proteins (Sun and Shen, 2025).

RkJQdWJsaXNoZXIy MjQ4ODYzNA==