Computational Molecular Biology 2025, Vol.15, No.4, 171-182 http://bioscipublisher.com/index.php/cmb 17 6 pathway. We even found some potential associations in the disease subplots, such as the indirect connection of Tau and GSK3β through signaling pathways. Overall, these topological features reveal the hierarchy and patterns within the network and also provide many new hypotheses for subsequent biological experiments (Mall et al., 2017; Seoane, 2024). 4.3 Semantic reasoning and relationship prediction The beauty of a knowledge graph lies not in how much data it stores, but in what it can "guess". It is not merely a warehouse; it is more like a reasoning system. In the molecular interaction atlas, we hope it can identify missing connections on its own, such as predicting protein interactions or drug effects that have not yet been verified. There are generally two approaches: relying on rules or relying on models. Rule-based reasoning is logical. For instance, "If X activates Y and Y inhibits Z, then X might indirectly affect Z." Or perhaps A and B interact with each other, and B and C interact with each other. Then A and C might be on the same path. Such rules are clear but incomplete and are also easily disturbed by noise. So more people choose statistical methods - turning all the molecules into vectors to calculate which combinations are the most "compatible". We also predicted new interactions on the protein map in this way and compared them with the experimental data. Some of them were even confirmed by the literature. Later, we tried graph neural networks, enabling the model to aggregate neighbor information in subgraphs, making more accurate predictions and even telling you "why" - for instance, both A and B are connected to the key protein C (Kishan et al., 2020; Liu et al., 2021). Reasoning does not equal discovery. Predictions will eventually be verified, but it can help us pick out the most promising few from tens of thousands of molecules, greatly saving experimental energy. 5 Case Study: Knowledge Graph of Protein-Protein Interactions 5.1 Case selection and dataset description To see if this atlas construction method is really effective or not, we selected protein-protein interactions (PPI) as the test case. The PPI network is the most crucial part of molecular interactions, featuring dense information and numerous connections. It is involved in disease mechanisms and drug effects, and almost every systems biology research will encounter it. Nowadays, the PPI map of human beings has been developed quite extensively, but this time we do not aim for completeness; we only aim for accuracy. The data mainly comes from several common libraries: extracting human interaction records from BioGRID and selecting highly reliable data that has been supported by multiple experiments or multiple literature articles (Oughtred et al., 2020); The data of STRING is also used as a supplement, but only the part with a higher comprehensive score is selected (Popik et al., 2014). The annotation information of proteins was captured from UniProt and GO, while the disease associations were obtained from OMIM and DisGeNET. Finally, a bit of literature mining results were added, and some new interactions that were not yet included in the database were supplemented. When integrated, the map contains approximately 15,000 protein nodes, over a thousand disease nodes, and several thousand high-quality interaction relationships. We value credibility more than quantity. Interactions with weak evidence will not be included for the time being. For the convenience of calculation, this time the undirected interaction network is mainly analyzed, and the direction of regulation is not subdivided for the time being. All data have undergone automatic process cleaning and standardization, uniformly identified by UniProt ID, and finally imported into the graph database. The entire process can now be replicated. 5.2 Analysis process and visualization results After constructing the protein interaction map, we first conducted an overall statistics: approximately 15,000 protein nodes, over a thousand disease nodes, and more than 60,000 interactions, presenting a typical scale-free distribution. A few proteins, such as p53 and EGFR, have extremely large connections. The network is almost fully connected, indicating that the data integration is good. In the centrality analysis, signal hubs such as TP53, MYC, and AKT1 rank among the top. Further examination of the MAPK pathway reveals that the core proteins form a tight subnet with a clear structure. When magnifying the submap of the apoptotic pathway, the interaction patterns of the caspase family, Bcl-2 proteins, etc. are consistent with the classical pathway, but unexpected nodes such as SIRT1 also appear, suggesting potential new regulation. Link prediction also discovered several possible new interactions, some of which have been confirmed by the literature (López-Cortés et al., 2018; Kim et al.,
RkJQdWJsaXNoZXIy MjQ4ODYzNA==