CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 141-150 http://bioscipublisher.com/index.php/cmb 147 5 Case Study: Using ESM-2 for Predicting Mutation Impact on Protein Stability 5.1 Overview of ESM-2 and its training on large-scale protein sequences ESM-2 is a new generation of large protein language model proposed by Meta. Compared with earlier versions, it has a deeper network and more parameters (the maximum version has a parameter number of billions). ESM-2 was unsupervised trained on hundreds of millions of diverse protein sequences and learned rich evolutionary patterns and sequence features. Unlike traditional sequence analysis, ESM-2 can generate high-dimensional representation vectors for any protein sequence without relying on multiple sequence alignment, and performs excellently in downstream tasks such as structure prediction and function prediction. For instance, the ESMFold model utilized the representation of ESM-2 to directly predict the three-dimensional structure of proteins from sequences, demonstrating the pre-trained model's ability to capture the implicit patterns in sequences. This powerful sequence characterization lays the foundation for studying the mutational effects of single sequences (Figure 2) (Pak et al., 2023). 5.2 Application to missense mutations and ΔΔG prediction The change in protein stability is commonly measured by the ΔΔG value (the difference in free energy caused by mutations, with a positive value indicating a decrease in stability). The effect of single-point amino acid substitution on protein stability can be evaluated by using the ESM-2 model. An effective method is to input the wild-type and mutant sequences respectively into ESM-2 to obtain the embedding vectors, and then train a regression model to output ΔΔG. Because the ESM-2 embedding condenses the multi-level features of the sequence, even when trained on a smaller mutant database, this model can achieve a relatively high accuracy rate. There are also studies that directly compare the difference in probability scores of ESM-2 for sequences before and after mutation as an indicator of stability changes (Zhang et al., 2023). The actual results show that the prediction performance of the model based on ESM-2 on the public mutation stability dataset is comparable to that of specialized supervised learning methods, indicating that the knowledge of the pre-trained model is helpful for accurately capturing the impact of mutations on structural stability. 5.3 Comparison with structure-based and supervised learning baselines Compared with the physical methods that require a known three-dimensional structure to calculate ΔΔG, the prediction based on ESM-2 does not rely on the experimental structure and has higher computational efficiency. Therefore, it can be extended to proteins with unknown structures and quickly screen for a large number of mutations. Although molecular mechanical energy calculations may be more accurate in the case of providing high-resolution structures, in practice, ESM-2 predictions often achieve comparable accuracy. Compared with traditional supervised learning models, the advantage of ESM-2 lies in the extensive knowledge provided by its pre-training: previous algorithms required manual design of features and training on limited data, with limited generalization ability (Chu et al., 2024). However, the general representations generated by ESM-2 enable reliable results to be achieved even when training on small datasets, significantly improving the robustness of the model. It can be seen from this that the mutation stability prediction driven by ESM-2 has both accuracy and applicability, providing an efficient and reliable tool for studying the effects of protein mutations. 6 Challenges and Limitations 6.1 Data sparsity and imbalance in biological corpora Although the biological sequence database is large in scale, the distribution of information within it is not uniform. On the one hand, the data volumes of different species and protein families vary greatly, and models tend to focus on learning the dominant patterns in rich data while neglecting rare categories (Ruan et al., 2025). On the other hand, many sequences still lack functional or structural annotations, and the lack of high-quality labeled data makes it difficult for models to be fully trained on certain specific tasks. This kind of data sparsity and imbalance may limit the model's ability to characterize rare functions or novel sequence patterns. 6.2 Model interpretability and biological plausibility The decision-making process within pre-trained models is often as difficult to understand as a "black box". At present, it is very difficult for us to clearly determine what biological characteristics a certain neuron or attention

RkJQdWJsaXNoZXIy MjQ4ODYzNA==