CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 167 For the issue of inconsistent terminology, researchers usually make some "fixes" before and after the model. One approach is to perform vocabulary mapping during the input stage, such as uniformly replacing common aliases with standard names. Another approach is to perform comparative correction using the database at the output stage (Pinero et al., 2020). Some people have also attempted to directly integrate medical vocabularies (such as UMLS) into the model reasoning process, using them to constrain the naming of the generated entities. However, at present, LLMS mainly generate text based on statistical correlations and do not automatically follow medical naming conventions. This means that even if the model generation results are smooth, additional post-processing steps are still needed to "clean" them; otherwise, the constructed knowledge base may still be chaotic. As for the issue of data quality, several compromise approaches are currently being explored. Relying solely on manual annotation is too slow. Therefore, some studies adopt "silver standard" data, that is, allowing the model to generate annotations by itself or automatically using rules (Habibi et al., 2017). Although it is not as precise as manual labor, it can make up for the shortage in terms of quantity. Then fine-tuning with a small number of "gold standard" samples can compensate for the deviation to a certain extent. There is also the idea of active learning, which enables the model to help pick out samples with large amounts of information for priority labeling, achieving higher results with less labor costs. In other words, the problem of data cannot be solved in the short term, but by using data smartly, perhaps the model can learn more intelligently from "dirty data". 6.2 Model illusion and challenge of result verifiability The "illusion" problem of large language models is almost an insurmountable hurdle. It sometimes solemnly generates non-existent content that looks decent but is actually groundless. This situation is particularly evident in knowledge extraction tasks - the model may "fabricate" a pair of associations between genes and diseases based on impression, even if it is not mentioned at all in the original text. The reason for this is that LLMS study the statistical laws of language rather than the facts themselves. It will guess based on semantic "inertia", and even in the absence of evidence, it may force out a seemingly reasonable answer. This is no small problem in the field of biomedicine. Both scientific research and clinical practice emphasize "evidence". Once a model outputs incorrect relationships, the consequences could be very serious. For instance, if it fabricates the interaction between a certain drug and a target, researchers might waste experimental resources in vain. Or, if a gene is wrongly associated with a disease and the doctor interprets it this way, it may lead to misjudgment (Shah et al., 2023). In other words, the "confidence error" of a model is more dangerous in scientific research than silence. To deal with this situation, a common approach is to add an extra layer of "verification". After the model extracts the relationships, it does not rush to store them in the database. Instead, it checks the original text through the search or discrimination module to see if there are any relevant sentences to support it. If not found, it is marked as low credibility and not included in the final result (Nori et al., 2023). The RAG model proposed by Lewis et al. (2020) follows this idea. It retrives relevant literature simultaneously during generation, making the output more "well-grounded". In our experiment, we also tried to have BioGPT attach the original sentence as a basis when generating relationships. Although not always accurate, it does enhance the verifiability of the results. Another approach is to have the model learn to "admit not knowing" during the training phase. For example, some samples are added during instruction fine-tuning to enable the model to answer "not mentioned" or "unable to judge" in the absence of evidence. This approach is somewhat like adding a "refusal mechanism" to the model, enabling it to learn to shut up when it is uncertain. Although this cannot completely eliminate the illusion, it can reduce its tendency to fabricate facts at will. In addition, researchers are also attempting to evaluate the reliability of the model output in a more systematic way. For instance, use the existing knowledge graph to conduct consistency checks on the generated results to see if there are any conflicts with the known facts. If the model claims that "mitochondrial DNA causes certain skin diseases", and biological common sense explicitly denies this causality, then this output can be identified as an illusion and eliminated. Some people even let the model "reflect on itself" and evaluate the credibility of their

RkJQdWJsaXNoZXIy MjQ4ODYzNA==