CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 160-170 http://bioscipublisher.com/index.php/cmb 168 answers in the form of dialogue (Ji et al., 2023). Although these methods are not yet perfect, at least they indicate that AI is no longer merely regarded as an "output machine", but is gradually being taught how to "verify itself". 6.3 Privacy protection and intellectual property issues The application of large language models in the biomedical field is not merely a technological breakthrough; it has also brought about new concerns regarding privacy and copyright. Models often have to deal with a large amount of medical text during training, among which there are many sensitive contents, such as patient cases or electronic health records (EHR). Although most of our current research is based on publicly available scientific literature, the situation will be completely different once the model is applied to the clinical field. LLMS sometimes "remember" training data. If they inadvertently disclose information such as patients' names and case numbers during generation, it would be a serious privacy violation. Such risks blur the ethical boundaries of models and force people to re-examine the bottom line of data usage. To prevent privacy leakage, the most direct approach is to clean up data from the source. Desensitization processing must be carried out before training to remove or replace the information that can identify the individual's identity (Pinero et al., 2020). Some studies are still attempting to enhance protection through technical means, such as incorporating differential privacy mechanisms during model training, enabling the model to learn overall patterns rather than specific individuals. Even if this results in a slight loss of accuracy, it is worth it. For knowledge extraction tasks, focusing on statistical information or public knowledge at the group level is a more reliable approach (Mesko and Topol, 2023). For instance, only extract collective patterns such as gene-disease relationships and avoid involving individual case descriptions. If the model is really used in sensitive data scenarios such as clinical notes in the future, these protective measures are likely not optional but "mandatory". Apart from privacy, copyright issues are equally thorny. The training data of large language models often contains protected texts such as full papers and patent descriptions, which turns "whether the model output constitutes infringement" into a gray area. If the sentences output by the model during the generation process are highly similar to the training text, even if they merely state facts, they may be questioned as improper quotations. In our experiment, BioGPT was required to output the basis sentences in the literature. Strictly speaking, this approach might be on the verge of copyright infringement, but in academic research, it is usually considered fair use. To reduce disputes, models can be taught to "paraphray" rather than "copy". That is to say, let it summarize the facts in its own words instead of copying the original text word for word (Moor et al., 2023). Meanwhile, the source should be clearly indicated in the results, which is both a respect for the original author and convenient for others to verify. This approach not only complies with legal requirements but also conforms to the principle of transparency in scientific research. However, fundamentally speaking, the cleanest solution still lies at the data level. Nowadays, more and more open-licensed medical datasets and knowledge graphs have emerged, and researchers can fully train models based on these open-source resources. The LLM obtained in this way has no worries about copyright and is both legal and safe. If the training of future domain models can all be based on these public data, the predicament of privacy and copyright might be alleviated from the source. 7 Future Outlook The potential of future large language models in extracting biological knowledge is far from being fully unleashed. The real challenge is not merely to "make the model smarter", but to "make the model more trustworthy". The biggest problem at present still lies in interpretability and verifiability - the model can output results, but it is difficult to explain "why". So the focus of the next research is likely to be on making the model "clearly state what it is doing". Only in this way can researchers truly trust it. In fact, large language models and knowledge graphs are essentially a combination of two ways of thinking. One is good at capturing patterns from jumbled text, while the other excels at organizing knowledge with logic and structure. If the two can be truly integrated, an effect of "1+1>2" may occur. At that time, we might witness a new

RkJQdWJsaXNoZXIy MjQ4ODYzNA==