CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 165 literature, and after manual review, sorted out a labeled training set. What BioGPT aims to do is to be able to identify similar relationships in new literature after learning these examples. However, this task is not merely about "matching words". The expression styles of biomedical papers are diverse, and the same meaning often has different ways of being written. For instance, "BRCA1 mutations can lead to an increase in breast cancer susceptibility" and "BRCA1 is a susceptibility gene for breast cancer" refer to the same thing, but the sentence structures are completely different. To enable the model to recognize such semantically equivalent expressions, we deliberately retained a variety of sentence patterns when constructing the training set. This design can help BioGPT better understand the context, thereby improving the recall rate, and also make it less likely to be confused by the surface wording when dealing with literature of different writing styles. 5.2 Model design and training process For this part, we adopted a relatively common approach - "pre-trained model + downstream fine-tuning". In simple terms, it involves first using a model that has already learned to "understand biological language" as a foundation, and then training it for specific tasks. BioGPT is just right. It has been pre-trained on a large number of biomedical literatures and has accumulated sufficient language patterns and domain knowledge (Luo et al., 2022), and has a certain understanding of the concepts of genes and diseases. Next, what we need to do is to make it more "focused", and through supervised fine-tuning, enable the model to learn to extract gene-disease relationships from the literature. However, we did not let it perform the standard task of classification, but transformed the problem into a generation task. The specific operation is as follows: Feed the literature abstract to the model and add a prompt at the beginning. Please extract the genes mentioned in the text and their related diseases:" In this way, the model knows that it needs to "list" the results instead of continuing to write the article. This design enables BioGPT to directly generate the output format we need and is more in line with its characteristics as a generative language model. The training data comes from the DisGeNET annotation set mentioned earlier. Each sample includes a summary and the corresponding gene-disease pairing. We let the model learn to "repeat" these paired contents and complete the training by minimizing the difference between its output results and the standard answers. Since BioGPT itself is a generative model, we have retained its original generation mechanism, enabling it to output multiple relationships in a single generation without the need to split them into multiple independent samples. We made some conservative adjustments to the training parameters. The parameter scale of BioGPT is very large. A learning rate that is too high can easily cause it to forget the original knowledge. Therefore, we used a relatively small learning rate, approximately 2e-5, and controlled the training rounds within three rounds (Devlin et al., 2019). To save computational effort, we also froze the underlying parameters of the model and only fine-tuned the high-level Transformer layer and the output part. Doing so can maintain the stability of the model language generation and also avoid overfitting. Finally, we use the untrained samples from DisGeNET to evaluate the model's performance. The evaluation criteria remain Precision, Recall and F1 value. To determine whether the extraction is correct, we require that the gene names and disease names generated by the model must be consistent with the standard answers, and the relationship types should also match semantically. Of course, we allow for some differences in expression. For instance, synonyms like "breast cancer" and "breast cancer" are regarded as equivalent. After all, in the biomedical context, such differences do not affect the accuracy of knowledge points. 5.3 Experimental results, performance comparison and analysis of result interpretability The trained BioGPT performed well in the extraction of gene-disease relationships. The validation set results show that the Precision is approximately 0.80, the Recall is approximately 0.78, and the F1 value is close to 0.79. This level is already similar to the current best supervised model. If the unfine-tuned BioGPT is directly used for extraction, the Precision is only around 0.60. It is evident that fine-tuning for the task is still indispensable. We

RkJQdWJsaXNoZXIy MjQ4ODYzNA==