CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 171-182 http://bioscipublisher.com/index.php/cmb 17 2 knowledge graphs almost cover all known RNA interactions. Some people also use it to explore new uses of drugs. By connecting drugs, targets and disease nodes, algorithms can discover potential new relationships (Zhou et al., 2024). In clinical diagnosis, it can also help doctors quickly match symptoms with diseases, improving the efficiency of diagnosis and treatment. A more complex approach is to weave multi-layer data such as genomes and proteomes into heterogeneous networks, use maps to analyze key regulatory relationships, and even automatically absorb knowledge from literature through text mining. Although knowledge graphs have been widely applied, their construction efficiency, update speed and scalability are still insufficient. These problems are precisely the key issues that need to be overcome in the future. Although knowledge graphs have achieved a lot in bioinformatics, when it comes to molecular interactions, there are still quite a few problems. Data comes from all directions and in various formats. How can we make this information speak "in the same language"? For instance, for core entities like proteins, semantic standardization is always misaligned. If not done well, it will lead to a mess in subsequent analyses. There is an even more intractable question - can the established atlas really help us infer new molecular relationships? These are precisely the things that this study aims to explore. First, we designed a set of construction plans, taking into account data cleaning, entity modeling and storage architecture, so that proteins, small molecules and the like can all have clear positions in the map. Then explore methods such as graph representation learning, network analysis, and semantic reasoning to see if they can help discover new interaction clues; Take protein-protein interactions as an example again, create a case map, and verify the reasoning results through experiments and literature. Finally, a prototype system was set up to make the spectra visible and interactive, which is convenient for both drug mechanism and disease network research. Compared with our predecessors, we pay more attention to the integration and exploration at the detailed level, connecting theory, algorithms and applications into a line, hoping to take the research on molecular interactions one step further. 2 Data Sources and Preprocessing 2.1 Data types and sources To build a reliable knowledge graph of molecular interactions, the first step is often not modeling but "retrieving data". Genes, proteins, small molecules, pathways, and diseases - these pieces of information are scattered in different places, some in databases and some hidden in papers. For protein-protein interactions, the ones that people often look up include IntAct, STRING, and BioGRID. The relationship between drugs and targets is mostly derived from Drugbanks or ChEMBL. As for functional annotations and semantic systems, GO and UMLS are almost unavoidable sources. Of course, these data were never readily available. The naming conventions of different databases vary. The same protein may be called differently in different databases. It is necessary to unify the identification first. Some mutual records of literature have not been reproduced, and their credibility needs to be rated. The types of interactions we focus on mainly include protein-protein, protein-small molecule, and gene regulation, etc. Therefore, when collecting, we not only need to capture the database but also rely on text mining to extract the description of "who interacts with whom" from PubMed (Figure 1) (Feng et al., 2022). Structural resources can also come in handy, such as finding clues to protein complexes from the eutectic structure of PDB. Overall, this process is more like a "jigsaw puzzle": the data from different companies vary greatly, and only through repeated comparisons and cross-verifications can the entire chart be both comprehensive and reliable (Zhou et al., 2024). 2.2 Data standardization and cleaning Obtaining the initial data doesn't mean the work is over. The truly troublesome part often comes later - cleaning and standardization. Data from different sources are like speaking different dialects, with distinct names, formats, and symbols. Without uniformity, it's simply impossible to proceed further. Usually, we first sort out the naming conventions. For instance, proteins are labeled with UniProt ID, genes with NCBI Gene ID, and small molecules with DrugBank or ChEBI numbers. This way, there won't be any confusion during the merging process. Next, it is necessary to give these concepts a "home", using an ontology system like GO to define functions, categories, and levels. Relationships also need to be defined in a unified way. Protein interactions should be unified as "interacts_with", and those with specific directions should be clearly labeled. The cleaning stage is more like a

RkJQdWJsaXNoZXIy MjQ4ODYzNA==