CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 171-182 http://bioscipublisher.com/index.php/cmb 17 3 meticulous job: duplicates need to be removed, low-credibility data should be filtered out, and missing attributes should be filled in - even filling in "unknown" first is better than leaving it blank. Finally, convert all formats into a unified structure, such as RDF or CSV, to facilitate the import into the database (Mavridis et al., 2025). The results we have sorted out this time contain approximately tens of thousands of protein entities and tens of thousands of interaction relationships, each accompanied by a source identifier. The data processed in this way is clean and uniform, and can finally support the subsequent construction of the graph (Schulz et al., 2013). Figure 1 Architecture of the e-TSN web application (Adopted from Feng et al., 2022) Image caption: The workf low involves several stages of scientific documents download, preprocessing, named entity recognition, relation extraction, knowledge discovery and visualization (Adopted from Feng et al., 2022) 2.3 Entity recognition and semantic normalization When dealing with raw text data, the first problem one often has to confront is not the algorithm but rather the question of "what to call it". In different papers and databases, a protein may have several names - they need to be recognized first and then unified. The step of named entity recognition is to enable the system to automatically extract these biological names from the text and the relationships between them. For instance, in the sentence "P53 directly interacts with MDM2", two proteins and their interaction can be identified. Deep learning models are particularly useful here, especially those that can handle confusing aliases and irregular formats (Habibi et al., 2017). However, recognition is just the beginning; the real challenge lies in normalization. Like "TP53", "p53 protein" and "tumor suppressor protein 53", they all refer to the same thing - TP53. To make the graph recognize this, it is usually necessary to check the standard library, such as UniProt or NCBI, and match it with a unified ID. The same is true for compounds. One name may correspond to multiple trade names or chemical names. The naming of relationships should also be unified. It is best to group "inhibition" and "negative regulation" into the same category. As for those easily confused names, such as "APC", which are sometimes genes and sometimes protein complexes, one can only make a judgment by combining the context or the existing structure of the map. Only after all these have been processed can the data be considered "clean" and be firmly transformed into nodes and edges in the graph (Sung et al., 2022). 3 Knowledge Graph Construction Methods 3.1 Entity and relationship modeling After the data is sorted out, the "framework" still needs to be set up. In the molecular interaction map, nodes usually include proteins, genes, small molecules, diseases, etc. The coding relationship between genes and

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==