Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 160 Review Article Open Access Large Language Models for Biological Knowledge Extraction Hongpeng Wang, Minghua Li Biotechnology Research Center, Cuixi Academy of Biotechnology, Zhuji, 311800, China Corresponding author: minghua.li@cuixi.org Computational Molecular Biology, 2025, Vol.15, No.4 doi: 10.5376/cmb.2025.15.0016 Received: 01 May, 2025 Accepted: 10 Jun., 2025 Published: 01 Jul., 2025 Copyright © 2025 Wang and Li, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.6 Preferred citation for this article: Wang H.P., and Li M.H., 2025, Large language models for biological knowledge extraction, Computational Molecular Biology, 15(4): 160-170 (doi: 10.5376/cmb.2025.15.0016) Abstract The surge in biomedical literature has led to severe information overload for researchers, necessitating automated knowledge extraction tools. Large Language Models (LLMs), which have emerged in recent years, demonstrate superior performance in text understanding and generation, providing a new approach for biological knowledge extraction. This study reviews the applications of LLMs in tasks such as named entity recognition, relation extraction, and event extraction, and discusses their latest advancements in subfields such as genomics, proteomics, and pharmacology. The advantages of LLMs over traditional methods in contextual understanding and semantic representation are analyzed, along with the optimization effects of domain adaptation, fine-tuning, and cue engineering on model performance. A case study of extracting gene-disease associations using the BioGPT model demonstrates the application process and effectiveness of LLMs, while also analyzing challenges related to data quality, model illusion, and privacy protection. The future directions of LLM integration with knowledge graphs, multimodal data integration, and knowledge verification are discussed, along with related ethical considerations. These advancements are expected to provide new paradigms for future biomedical research. Keywords Large language model; Biomedical text mining; Knowledge extraction; Knowledge graph; Cue engineering; engineering 1 Introduction The growth rate of knowledge in the biomedical field is almost suffocating. In 2016 alone, more than 860,000 new papers were added to PubMed, with an average of one new paper emerging every minute. Researchers are often submerged in this flood of information, and it becomes increasingly difficult to sort out and absorb key information in a timely manner. This is precisely the so-called "information overload" problem (Brown et al., 2020). As a result, people began to pay more attention to information extraction technology, hoping to use it to automatically identify useful biological entities, relationships and events from vast amounts of unstructured text, and transform disordered words into structured knowledge that machines can understand. The problem is that, in the face of such a huge scale of data, traditional text mining methods seem somewhat struggling. Early practices mainly relied on experts to formulate rules, depend on ontology libraries, or combine artificial features to train machine learning models. However, in the context of biomedical literature where terminology is complex and sentence structures are lengthy, these methods often have limited effects. It was not until the emergence of deep learning, especially the rise of large language models, that hope was reignited in this field (Topol, 2019). The so-called large language model is actually a giant neural network trained on massive corpora, which "remembers" a large amount of language knowledge with a huge number of parameters (Wiggins and Tejani, 2021). Since Vaswani et al. (2017) proposed the Transformer architecture of "attention mechanism is everything", the capabilities of language models have almost multiplied with scale. BERT is one of the milestones, which achieves deep understanding of context through bidirectional Transformer. Then, models such as GPT-3 have even broken through the task boundary and can complete the extraction class task with only a few examples. The knowledge and language rules learned in the general corpus make LLM a "basic model" that can be easily adapted to various biological information extraction scenarios through fine-tuning. Compared with those models in the past that relied on artificial rules or feature engineering, the "smartness" of large language models lies in their ability to understand the context more naturally, generate coherent text, and
RkJQdWJsaXNoZXIy MjQ4ODYzNA==