CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 146 For point mutations (single amino acid substitutions) on protein sequences, pre-trained models can also be used to evaluate their impact on function. One approach is to calculate the model probability or perplexity changes of the sequence before and after the mutation: if the model considers that the "synchrony" degree of the sequence significantly decreases after the mutation, it suggests that the mutation may disrupt the protein structure or function. This zero-shot scoring has been used to identify potential pathogenic mutations and is consistent with the trend of experimental determination. In addition, fine-tuning training by embedding language models in combination with a small amount of labeled mutation effect data can further improve the prediction accuracy. Overall, pre-trained models offer a new approach for large-scale and rapid mutation risk screening, demonstrating higher sensitivity and applicability compared to methods based on conservative scoring. 4.2 Gene expression and regulatory element prediction The prediction of gene regulation and expression levels is another important field that benefits from sequence language models. Early methods often relied on pre-identified sequence motifs to search for regulatory elements such as promoters or enhancers, but this approach missed unknown combination patterns. Pre-trained models offer a data-driven approach to automatically learn from genomic sequences which patterns are related to regulatory functions. Specifically, by using a DNA sequence model that has undergone self-supervised training, we can extract vector representations of fixed-length gene sequences (such as promoter regions or enhancer fragments), and then use them to determine whether the sequence has regulatory activity. Experiments have proved that when such embeddings are used for tasks such as classifying promoters/non-promoters, enhancers/non-enhancers, or predicting splicing sites, their performance is superior to traditional k-mer frequency or motif scanning methods. For instance, the DNABERT model achieved the latest performance at the time in multiple DNA sequence classification tasks, such as identifying elements that promote gene expression, by pre-training masking language models on large-scale genomic sequences. More broadly, genomic language models can also be used to assess the functional impact of mutations in non-coding regions: the difference in the model's scores of sequences before and after mutations can indicate whether the mutation may alter transcription factor binding or chromatin accessibility, thereby affecting gene expression. Although these current methods still need to be verified in combination with downstream experimental data, they have demonstrated great potential in unsupervised exploration of gene regulatory patterns, providing a powerful tool for understanding the "grammar" of non-coding DNA. 4.3 Protein structure and interaction modeling Pre-trained models can not only handle one-dimensional sequence features, but also demonstrate the ability to infer three-dimensional structures and molecular interactions. Thanks to capturing the implicit evolutionary information in sequences, large-scale language models have become a new approach for rapidly predicting protein structures. Taking Meta's ESM series models as an example, by embedding only a single-sequence Transformer model and combining it with a simplified folding algorithm, ESMFold has been able to directly predict a three-level structure close to experimental accuracy from the sequence. This achievement is remarkable because traditionally, high-precision structural prediction (such as AlphaFold2) relies on co-evolutionary clues provided by multiple sequence alignment, while ESMFold has demonstrated that language models can extract sufficient structural information even without homologous input. In the modeling of protein-protein or protein-nucleic acid interactions, each sequence can be input into the pre-trained model separately to obtain a representation vector, and then the matching degree of the two vectors (through a classifier or similarity) can be evaluated to determine whether they are likely to bind. Previous studies have utilized this strategy to predict protein-protein interaction partners and achieved good results. Overall, the sequence representations provided by pre-trained models have opened up new directions for the study of molecular recognition and binding. In the future, the combination of them with physical simulation methods is expected to further improve the prediction accuracy of complex interaction interfaces (Esmaeeli et al., 2023).

RkJQdWJsaXNoZXIy MjQ4ODYzNA==