Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 164 symbols or compound names, which can easily lead to word splitting errors. Training the segmenter directly with biological corpora like PubMedBERT can significantly reduce this problem (Gu et al., 2021). Meanwhile, continuing to conduct pre-training on biological texts can also make the model more "close to" the domain expression. Fine-tuning, on the other hand, enables the model to "practice fine skills" for specific tasks. For instance, by adjusting parameters on only a few thousand labeled samples, an F1 score of nearly 80% can be achieved in the task of extracting chemical protein relationships. For traditional models to achieve the same effect, they often require multiple times the amount of data. There are still some lightweight fine-tuning strategies now, such as freezing the underlying Transformer layer, training only the upper-level classifier, or using methods like LoRA to reduce computing power consumption. It can be said that domain adaptation makes the model "understand the trade", while fine-tuning enables it to "do the job". The combination of the two is the key to truly unleashing the potential of LLMS. They work well together. The model can not only extract known relations but also flexibly generalize to new types of text. 4.3 Application of prompt engineering and instruction fine-tuning in knowledge extraction Looking further ahead, the plasticity of large language models is also reflected in "prompts". Just tell it what to do and it will do it. This is the magic of prompt engineering. For example, if we directly ask ChatGPT to "extract the relationship between genes and diseases from the following text", it can output a structured result without even fine-tuning (Kung et al., 2023). However, how the prompts are written has a significant impact. Even a slight difference in wording can lead to vastly different effects. Providing examples and adjusting the format often enable the model to understand the task intent more accurately. For instance, in relation extraction, a question pattern can be used to prompt: "Entity 1 [GENE], Entity 2 [DISEASE]: Does Entity 1 cause Entity 2?" The model can answer with "Yes/No" or specific relationship types, and this format is often more natural than traditional classification. In addition to manually designed prompts, fine-tuning of instructions further enhances the model's "obedience" ability. Models such as InstructGPT are trained with large-scale instruction data to better understand users' intentions. In the knowledge extraction task, this means that as long as the input is "list the diseases mentioned in the text and their related genes", the model can directly generate results that meet the requirements. However, LLMS are not immune to mistakes either. Sometimes it "makes up confidently", which is what is called an illusion. To this end, researchers introduced verification mechanisms, such as cross-checking the relationships extracted by the model with the database, or having it provide the original text basis simultaneously. Although these practices complicate the process a bit, they can make the extraction results more reliable and also give humans more confidence when trusting the model. 5 Case Study: Gene-Disease Relationship Extraction Based on BioGPT 5.1 Case background and data sources (PubMed abstract and gene database) In biomedical research, the relationship between genes and diseases is almost everywhere. The pathogenesis of many diseases can be traced back to mutations or abnormal expressions of certain genes, which is also one of the most core knowledge in genetics and precision medicine. To more intuitively demonstrate the role of large language models in knowledge extraction, we have selected a typical task - "gene-disease relationship extraction" - as an example for illustration. Here we use BioGPT. This model is trained based on large-scale biomedical literature and has strong professional semantic understanding and generation capabilities. We want it to automatically identify triples like "Gene A is associated with disease B" from unlabeled literature abstracts, and then compare these results with the existing associations in authoritative databases to see if it truly "understands biology". The selection of the data section actually also requires careful consideration. We use the abstracts of biomedical literature from PubMed, which often contain rich information such as gene functions and disease mechanisms, and are very suitable for input in relation extraction. As a reference standard, we selected the DisGeNET database (Pinero et al., 2020). This database integrates human gene-disease association data from multiple sources and contains tens of thousands of entries. It is currently recognized as an authoritative knowledge base (Pinero et al., 2020). We selected a batch of highly reliable associations from them, mapped them to the corresponding PubMed
RkJQdWJsaXNoZXIy MjQ4ODYzNA==