CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 144 areas such as dialogue generation and writing assistance (Ramprasath et al., 2022). BERT and GPT respectively represent the two major categories of encoder-type and decoder-type in pre-trained models. Their success demonstrates the powerful potential of pre-training combined with the Transformer architecture. Nowadays, the ideas of these two types of models are being borrowed and extended to the field of biological sequences, providing new tools for analyzing DNA, RNA and protein sequences. 3.2 Self-supervised learning tasks adapted for sequences (e.g., masked language modeling, next-token prediction) The reason why pre-trained models are powerful lies in the fact that they adopt the self-supervised learning strategy, automatically designing training tasks from massive unlabeled data to approximate the essential statistical laws of sequences. When applying the pre-training paradigm of NLP to biological sequences, the training tasks need to be modified accordingly to adapt to the characteristics of DNA, RNA or protein sequences (Figure 1) (Kim et al., 2023). Figure 1 (a) Overall architecture of GPCR-BERT. Amino acid sequences are tokenized and subsequently processed through Prot-Bert, followed by the regression head. (b) Structure of Prot-Bert transformer and the attention layer. The input token embedding is transformed into keys, queries, and values which subsequently form the attention matrix. The output is passed through a fully connected neural network. This sequence of operations is iterated 30 times to reach the final output embedding of the GPCR sequence. (c) Representation of the top five most correlated amino acids to the first x (red) and second x (blue) in NPxxY motif within a GPCR obtained through attention heads. The thickness of the lines represents the strength of correlations (weights) (Adopted from Kim et al., 2023)

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==