CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 166 also compared it with the traditional scheme of BioBERT plus classifier, whose F1 value is approximately 0.75, which is about 4 percentage points lower than that of our generative model. It seems that the end-to-end generation method of BioGPT can indeed understand the text semantics more comprehensively, especially performing better when dealing with implicit relations. Of course, the model is not perfect either. After a careful analysis of the errors, it was found that most of the problems lay in the recall rate - it missed some relationships that should have been extracted. These missed detections can be roughly divided into two situations: One is that the relationship is written too implicitly and can only be identified by combining background knowledge; Another category is because extremely obscure terms or rare gene names are used in the text, and the model is not very familiar with them. For instance, an abstract mentioned a rare disease and an uncommon gene, but the model simply did not extract them. In fact, even human readers have to look up literature to confirm such sentences. So from this perspective, the model's performance can be considered stable. In terms of interpretability, we conducted a small experiment, asking the model to simultaneously indicate the "reason" or the basis sentence when outputting each relationship. The approach is to adjust the prompt, requiring it to list the gene-disease pairs while marking the original sentence from the summary. BioGPT can identify key sentences in most cases and output them along with the extraction results. For instance, when it extracts "BRCA1-breast cancer", it will attach "BRCA1 gene mutations are significantly enriched in patients, suggesting a correlation with breast cancer susceptibility" as evidence, which is consistent with manual judgment. This ability makes the results more trustworthy for researchers because they can see the evidence directly. However, there are exceptions. Sometimes, the model will quote entire sentences or even entire paragraphs. As long as it contains a little relevant information, it will all be uploaded, which seems a bit redundant. This indicates that while pursuing interpretability, some post-processing steps are also needed to make the long sentences provided by the model more compact. We also noticed another type of error: the model occasionally misjudges "co-occurrence" relationships as causal ones, such as only mentioning genes and diseases simultaneously, but not stating that the former leads to the latter. This type of error requires particular caution in knowledge graphs. In the future, it can be considered to add causal reasoning or discrimination modules to help the model be more "cautious" during extraction. 6 Challenges and Limitations 6.1 Data quality and standardization of biological terminology issues Although large language models have demonstrated considerable potential in extracting biological knowledge, problems follow one after another when they are actually implemented. The first problem often encountered is not the "intelligence quotient" of the model itself, but rather the confusion of data and terminology. Biomedical texts are filled with various abbreviations, aliases and non-uniform names. Different researchers and even different databases have different names for the same gene and the same disease. For instance, A gene is represented by its full name in Paper A, abbreviated in Paper B, and then assigned a different code in the database. This is especially true for diseases, where sometimes scientific names and common names are used interchangeably. The result is that when the model is training or reasoning, it will recognize the same entity as several different objects, and the extraction results will be divided into multiple parts, or even contradictory to each other. What's more troublesome is that the quality of data annotation is often not satisfactory either. Manual annotation requires biomedical experts, and such labor is both scarce and expensive (Esteva et al., 2019). However, due to insufficient labeling and the model's inability to acquire high-quality knowledge, it can only rely on a small number of samples for training, which easily leads to overfitting to specific expressions. Once encountering rare relationships or novel descriptions, the generalization ability of the model appears insufficient. Moreover, if the training data itself is biased or labeled incorrectly, the model will accept all these errors and even magnify them during generation.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==