CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 161 handle long sentences and complex dependencies commonly seen in the biomedical field (Ouyang et al., 2022). The self-attention mechanism of Transformer enables the model to directly "focus" on the relevant parts in the sentence when decoding an entity or relation, solving the problem that previous models had difficulty capturing long-distance semantics. It is precisely for this reason that LLM is regarded as a powerful tool to break through the current bottleneck of biological knowledge extraction. This article will focus on its application progress in this field, discussing its advantages, limitations and future development directions. 2 The Research Background of Biological Knowledge Extraction 2.1 Definition and key tasks of knowledge extraction In essence, the extraction of biological knowledge is about automatically identifying those "knowledge points" from disorganized information such as literature and databases that machines can understand and connect. This knowledge is often hidden in unstructured text, such as gene names, protein names, disease or compound names, etc.. To sort out this information clearly, several steps are usually involved: first, identify the key nouns (named entity recognition), then determine the relationships between them (relation extraction), and finally capture more complex biological events (event extraction). It sounds natural and logical, but it is not easy to do. Take NER as an example. The same protein may have different names, and some may even have symbols or numbers. The rule system can easily mistake or miss them. The task of relation extraction goes further. For example, to identify the causal relationship between genes and diseases from sentences such as "BRCA1 gene mutation increases the risk of breast cancer". Such ability is particularly important for building biomedical knowledge graphs because it can connect scattered entities into a network. However, event extraction is the most complex - it not only needs to know "who is related to whom", but also understand "what happened". For example, in the sentence "TP53 mutation leads to cell cycle arrest", "mutation" is the trigger word, while "TP53" and "cell cycle arrest" are the thesis elements. Such tasks often require the model to have a deeper understanding of the context, where traditional methods seem inadequate. 2.2 Traditional biological text mining methods and their limitations Before the popularity of large language models, biological text mining mostly relied on rules and machine learning. The early approach was more like "writing programs to teach machines to recognize words": first list the dictionary, then write the matching rules, and manually design a whole set of templates to recognize terms (Beltagy et al., 2019). This method is acceptable in fields where the terminology is relatively stable, but it is prone to collapse once encountering new concepts and new names (Wang et al., 2018). The problem of synonyms in biomedicine is particularly troublesome. A drug may have multiple alternative names, and it is difficult to capture all of them by string matching. Later, with the rise of statistical machine learning, support vector machines, hidden Markov models, etc. began to be used, and the degree of automation improved somewhat. However, various features still needed to be designed manually, such as context Windows, part-of-speech tags, syntactic paths, etc. Once these features change domains, the model performance often drops. What's more troublesome is that these models cannot effectively utilize unlabeled large corpora and are not flexible enough in capturing complex sentences and long-distance relationships. 2.3 Technological evolution from statistical learning to deep learning The turning point of technology emerged with the rise of deep learning. Around 2010, statistical models were still dominant, but neural networks soon took over. RNN and CNN have been successively introduced into relation extraction and entity recognition tasks. The models have begun to be able to automatically learn features and no longer rely entirely on human intervention. CNN performs well in drug-protein relationship extraction, and the architecture of bidirectional LSTM combined with CRF also outperforms the traditional CRF model in gene and chemical entity recognition. The year 2018 can be said to be a crucial one. ULMFiT proposed the idea of pre-training and fine-tuning, proving that it is more effective to learn languages from large corpora first and then perform specific tasks (Howard and Ruder, 2018). Immediately afterwards, the emergence of BERT completely changed the rules of the game. It implemented deep modeling of bidirectional contexts using the Transformer architecture. This enables the model to "understand" the context rather than merely memorize the key words.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==