CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 171-182 http://bioscipublisher.com/index.php/cmb 17 4 proteins should be marked. Each type of entity has its own attributes, such as the function and sequence of proteins, and the target and application of drugs. The core of the relationship is "interaction", but there may also be directional relationships such as "activation", "inhibition", and "combination". If it is too detailed, the data will become sparse. Therefore, it is often first unified as "interaction", and then further subdivided during reasoning. Cross-layer relationships can also be added, such as "gene mutations cause diseases" and "drugs treat diseases". Some relationships are aimless, while others have a direction. To complete the information, the reasoning relationship of "proteins of the same family" can also be added to help the model discover similarities (Zhou et al., 2024). The final graph framework is clear, facilitating subsequent expansion and analysis (Taneja et al., 2022). 3.2 Graph architecture design and storage model After determining the entities and relationships, the next step is to figure out a way to "fit" these things in. There are generally two approaches: either follow the semantic web system of RDF or use the attribute graph model of graph databases. RDF emphasizes norms, with clear triples and the ability to directly use existing ontologies for semantic reasoning. However, once the data volume is large and the relationships are complex, the speed slows down. Property graphs are more like practical players. Nodes and edges can all be equipped with properties, making path lookup and centrality calculation fast. Many databases (such as Neo4j) rely on them. Considering that the molecular interaction graph has to handle tens of thousands of relationships, we prefer to choose the attribute graph. However, we cannot completely abandon semantics either. Therefore, adding ontology labels to the node attributes can be regarded as a compromise. In actual operation, each Protein node is labeled with a type, such as' :Protein ', along with attributes like name and species. Information can also be hung on the edge, such as (BRCA1)-[:INTERACTS_WITH {pmid:123456, method:"Y2H"}]->(PALB2), which not only shows who is interacting with whom but also knows the source of the evidence. Such an architecture is intuitive, flexible and convenient for expansion to larger diagrams in the future (Figure 2) (Tomaszuk et al., 2020; Alocci et al., 2015). 3.3 Automate the build process It is almost impossible to manually sort out tens of thousands of pieces of information on molecular interactions bit by bit. Not only is it slow, but it is also prone to errors. So we simply set up an automated process to enable the knowledge graph to "grow" by itself. The entire process is roughly divided into four steps: first, capture the data; then, clean and transform it; next, import it into the graph database; and finally, check it once. The step of extracting data is the most complicated. You have to write scripts to call the API, pull protein data from NCBI and UniProt, crawl abstractions from PubMed, and then run NER models in batches to extract interaction information. After obtaining the raw data, it is necessary to convert the format, turning them into nodes and edges, and also perform normalization at the same time, such as merging synonyms into one entity. The conversion results will first be saved in the CSV file so that people can check them at any time. Next, import Neo4j, load nodes and relationships with batch tools, and configure indexes to improve query speed. After the graph is completed, it still needs to be verified whether there are any isolated nodes, whether the relationships are correct, and whether the PMids are accurately connected. If they are not up to standard, we will go back and change the rules to run again. If this cycle is repeated several times, the graph will become cleaner and cleaner. Now, after running one round, it only takes a few hours from the raw data to the formation of the graph. The relationships are almost all traceable, and updates are much more convenient (Clancy et al., 2019; Li et al., 2020). 4 Graph Representation and Analysis Methods 4.1 Knowledge representation learning Building a map is just the beginning. The truly interesting part lies in whether one can "learn" something from it. We hope that these molecules and relationships are not just nodes and lines, but can be transformed into numbers that machines can understand. The approach is not complicated. To put it simply, it is to transform entities and relationships into vectors. This step is called embedding. After being converted into vectors, the model can use them to calculate similarities and make predictions. Different algorithms have different approaches. Some treat relations as translations (such as TransE), while others use ComplEx number Spaces to represent complex relations (like complex, RotatE) (Sun et al., 2018). There are many types of relationships in molecular interaction graphs, and they often have directionality. Therefore, we tend to choose models that can handle asymmetric

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==