Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 143 The PSSM position-specific scoring matrix uses evolutionary information to represent sequences and is usually constructed through multiple sequence alignment. For a given protein or DNA sequence, first collect the homologous sequences and compare them, count the occurrence frequency of 20 amino acids (or 4 bases) at each position, and form the probability distribution of that position, which is a column of the PSSM matrix. PSSM retains the evolutionary conservation of each position and is more biologically significant than single-heat coding. However, it assumes that each point is statistically independent and it is difficult to express the synergistic changes among different points. In addition, PSSM generation depends on existing homologous sequences, and its effect is limited when dealing with sequences lacking rich homologies (Chia and Lee, 2022). 2.3 Need for contextual representation in biological data Although the above-mentioned traditional coding methods have promoted sequence analysis to a certain extent, they generally lack the characterization of sequence context dependence. In biological sequences, the influence of a certain base or amino acid on function often depends on its sequence background. For instance, the role of a certain transcription factor binding site in a DNA sequence may be enhanced or weakened by the combination regulation of adjacent sequences. Similarly, whether an amino acid residue constitutes the active site of an enzyme depends on its spatial neighborhood in the tertiary structure of the protein. Static encodings such as single-heat or k-mer cannot assign different representations to the same element according to different environments. This is similar to how in human language, the meaning of a word changes with the context, and simple lexicographical encodings cannot reflect such differences (Fang et al., 2021, Sanabria et al., 2024). In addition, there are a large number of long-term dependencies and cooperative change patterns in biological sequences. For instance, distant amino acid pairs in proteins can maintain mutual cooperation through coevolution to sustain structural stability or functionality. Traditional feature representations often assume that sequence positions are independent of each other or only consider local fragments, making it difficult to capture such correlation information that spans the entire sequence. This limitation may cause the model to miss key functional clues or make misjudgments (He et al., 2024). Therefore, it is urgently necessary to introduce methods that can represent the global context information of sequences, so that the representation of each sequence element can dynamically reflect the sequence environment it is in. Such contextualization representations have been proven to be extremely effective in natural language processing and are also highly anticipated in the field of biological sequences. 3 Foundations of Pretrained Language Models (PLMs) 3.1 Overview of NLP-based models: BERT, GPT, and transformers The rise of pre-trained language models is attributed to the successful application of the Transformer neural network architecture in natural language processing. Transformer achieves efficient modeling of global dependencies in sequences through the self-attention mechanism, where the representation of each position can directly refer to the information of any other position in the sequence. Compared with traditional recurrent neural networks (RNNS), Transformers can process sequences in parallel and capture long-range relationships, thus performing outstandingly in tasks such as language modeling (Kalyan et al., 2021). Based on this architecture, several landmark NLP models have emerged, among which the representative ones include the BERT and GPT series. BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional encoder model. It is composed of stacked Transformer encoders and pre-trained on large-scale text corpora with a masking language model task, that is, randomly masking some words and then allowing the model to predict missing words, thereby learning the semantic representation of each word in the context. The pre-training of BERT enables it to generate deep context embeddings, and through fine-tuning in downstream tasks such as question answering and classification, it demonstrates an accuracy far exceeding that of previous methods. GPT (generative pre-trained transformer) belongs to the paradigm of autoregressive generative models. The GPT series models (such as GPT-2, GPT-3, etc.) use the Transformer decoder to predict the next word in sequence to train the model, which is a typical language model objective. Due to unsupervised reading of vast amounts of text, the GPT model can naturally generate coherent text and demonstrate astonishing capabilities in
RkJQdWJsaXNoZXIy MjQ4ODYzNA==