CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 162 When the era of large language models truly arrives, the speed of technological evolution can almost be described as a "leap". The model parameters range from tens of millions to hundreds of billions, and the capacity has skyrocketed. Moreover, researchers have found that continuing to pre-train on biomedical corpora can enable the model to better understand the language of professional domains. This is known as domain adaptive pre-training (Gururangan et al., 2020). PubMedBERT is a typical example. It was trained from scratch entirely with PubMed texts and outperformed BioBERT and General BERT in multiple tasks, indicating that the value of domain corpora is much greater than imagined. 3 The Current Application Status of the Three Major Language Models in the Field of Biology 3.1 Mainstream large language models and their performance in biological texts In the current field of biomedical information extraction, the stage is almost dominated by various large language models. Some researchers directly use general models, such as the GPT series. Some have simply developed versions specifically for biomedicine, such as BioBERT, PubMedBERT, and BioGPT. As soon as these models appeared, the traditional methods immediately seemed a bit "clumsy". The GPT series (such as GPT-2 and GPT-3) have been trained very thoroughly on general text, and their language understanding and generation capabilities are quite strong. Some people have attempted to apply them in biomedical question answering or literature abstract tasks, and can produce quite reasonable results even without specific task training. However, problems arise as well - their understanding of professional terms is often inadequate and their explanations are not precise enough. It is precisely for this reason that the academic community has begun to shift towards "domain-specialized" large models. For instance, BioBERT is based on BERT and further trained with a large amount of biological literature from PubMed and PMC, enabling the model to truly "understand" professional language. Experiments show that BioBERT performs approximately 1 to 3 percentage points higher than ordinary BERT in tasks such as biomedical NER and relation extraction (Lee et al., 2020). PubMedBERT goes a step further by training directly from scratch on the domain corpus, verifying the importance of the proprietary corpus. Microsoft's BioGPT, on the other hand, follows a generative approach and is trained on biomedical texts using the GPT architecture. It can not only generate knowledge descriptions, but also automatically continue the content of molecular interactions. In most of the six biological NLP tasks, it outperforms previous discriminative models. SciBERT is even more "cross-disciplinary". It has been trained on a large number of scientific publications, including multi-disciplinary texts such as those in biomedicine. The results prove that training with corpora and vocabularies in the field of science can indeed improve the performance of the model in academic texts, especially in long-length and terminologically intensive tasks. 3.2 Applications of the model in different fields of biology If we turn our attention to the different branches of biology, LLMS are almost everywhere. In genomics, it is used to extract the associations between genes and diseases, as well as genes and phenotypes, from a vast number of papers, helping researchers screen candidate genes. The research of Huang et al. (2024) also pointed out that on datasets such as ChemDisGene, the performance of Transformer-like models has approached the optimal level. In the field of proteomics, models are used in a different way - they can more accurately identify the interaction relationships between proteins. Models like SciBERT have surpassed the old rule system on BioCreative's PPI data. More complex tasks, such as identifying protein modifications or binding events, have also begun to rely on the long-range dependency modeling capabilities of LLMS to handle cross-sentence semantic associations. As for pharmacology and clinical medicine, the applications are simply too numerous to count. Adverse drug reactions, interactions, etc. previously required manual literature screening. Now, the model can automatically extract and update the drug knowledge base. BioGPT even achieved an F1 score of 83% in drug-drug interaction extraction, which is almost comparable to the expert-labeled results. The general model has not been idle either - ChatGPT achieved a near-pass score in the USMLE, while GPT-4 has a higher accuracy rate. Google's Med-PaLM was also used for medical Q&A, and it was said that the performance was not much different from that of clinical experts (Singhal et al., 2023). These results suggest that LLMS are becoming increasingly "doctor-like" in terms of medical knowledge expression and reasoning (Shah et al., 2023). However, different fields have different requirements for models. Genomics places more emphasis on the description of biological mechanisms, while

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==