CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 163 pharmacological texts focus on details such as dosage and drug names. Therefore, specialized models like CancerBERT emerged, focusing on oncology tasks, and their performance is indeed better than that of the general BioBERT. In the future, more models targeting specific subdomains may emerge, and the trend of segmentation has become a foretold conclusion. 3.3 LLM-driven knowledge graph construction and biological knowledge base update If the extraction of a single piece of literature is the "point", then the construction of a knowledge graph is the "surface". The capabilities of LLMS are automating the generation and update of large-scale knowledge graphs. The biomedical knowledge graph is essentially a huge network, with nodes being entities such as genes, diseases, and drugs, and edges representing the relationships among them. In the past, literature had to be organized manually, but now, most of the work can be accomplished by batch extraction of models. For instance, the gene-disease relationship network constructed by Percha and Altman (2018) integrates tens of thousands of pieces of association information. The PubMed Knowledge Graph of Xu et al. (2020) is even larger. It extracts entities and relationships from nearly 30 million abstracts and obtains a graph with millions of nodes and hundreds of millions of edges. All of these can be regarded as early achievements of domain models. Nowadays, researchers are further exploring the integration of LLMS and knowledge graphs, enabling "graphs" to assist "text" and also allowing "text" to complement "graphs". RAG (Retrieval Enhanced Generation) is a representative of this idea. It retrieves the knowledge base simultaneously when generating answers to improve the accuracy of the content. In biomedical scenarios, this mechanism can help models update knowledge in real time. For instance, when new literature reports that a certain mutation is related to a disease, the system can automatically extract and complete the knowledge graph. In addition to expanding knowledge, LLM can also assist in semantic normalization, merge and simplify repetitive or conflicting relationships, and even generate explanatory text to assist in verifying the relationships in the knowledge graph and enhance the readability and credibility of the knowledge base. 4 Knowledge Extraction Mechanisms and Optimization Strategies of the Four Major Language Models 4.1 Advantages of context understanding and semantic representation A major reason why large language models perform well in knowledge extraction is that they "see further". Traditional models are often limited by Windows and can only understand local information, while LLM can capture context relationships on a larger scale by relying on the self-attention mechanism of Transformer. This enables it to break away from the local binding of words when analyzing sentences. For instance, in the statement "Inhibiting the expression of the EGFRgene can reduce the survival rate of lung cancer cells", it is not misled by the surface word order but can match "reducing the survival rate" with "inhibiting the expression of EGFR" to draw the correct causal relationship. Compared with the past models that only relied on matching neighboring words, this global modeling ability is obviously more reliable. What's more interesting is that LLMS that have undergone pre-training have actually "internalized" a considerable amount of biological semantics. In models like BioBERT, the activation of certain neurons can even correspond to protein-protein interaction patterns. This enables them to make reasonable inferences through the semantic space when encountering unfamiliar terms in the extraction task. Sentences in biomedical literature are often lengthy and complex in hierarchy, with attributives and interjections interwoven. Traditional methods often get "lost" in this kind of sentence structure, while LLMS can bypass interfering words with multi-head attention and directly grasp the core relationship. For instance, in the sentence "We observed that the incidence of breast tumors significantly increased in mouse models carrying BRCA1 mutations", the model can automatically connect "BRCA1 mutations" with "tumor incidence", and the correct causal clues can be found without analyzing word by word. This ability to capture complex semantics is the true strength of LLMS. 4.2 Domain adaptation and fine-tuning techniques However, no matter how powerful a general model is, when placed in a professional field, it is impossible to become immediately proficient. To make it "speak in a professional way", domain adaptation and fine-tuning are necessary. Domain adaptation means continuing to feed the model with biomedical corpora to familiarize it with professional vocabulary and language habits. Many general LLMS 'vocabularies do not contain complex genetic

RkJQdWJsaXNoZXIy MjQ4ODYzNA==