Computational Molecular Biology 2025, Vol.15, No.5, 218-226 http://bioscipublisher.com/index.php/cmb 21 9 information to depict pathogen interaction networks within a unified framework, not only improving prediction efficiency, but also redefining the path of pathogen mechanism research. 2 The Biological Basis of Protein-Protein Interactions Among Pathogenic Bacteria 2.1 Characteristics of the protein interaction network of pathogenic bacteria Although the protein interaction network of pathogenic bacteria is complex, it follows certain rules. Most proteins interact only with a few partners. A few "hub" proteins, such as RNA polymerase or ribosome components, are densely connected to form the network core. Networks often exhibit "small-world" and modular characteristics: functional modules such as flagella, secretory systems, and membrane synthesis are closely integrated internally, while the connections between modules are sparse. Cross-species conserved interactions (such as DNA polymerases and sliding clips) reveal evolutionary stability (Szymborski and Emad, 2024). Identifying these structural patterns helps to discover both critical and vulnerable targets for antibacterial intervention. However, the compact genomic structure and high interactivity reusability of pathogenic bacteria make network modeling more challenging. 2.2 Pathogenicity mechanism and the molecular basis of host-pathogen interaction Infection is essentially a molecular game between the pathogen and the host. The virulence systems of bacteria, such as Salmonella type III secretory system or ESX-1 of Mycobacterium tuberculosis, are all realized through protein-protein interaction assembly. If the key interaction is impaired, the virulence will decrease. Bacteria can also reconstruct metabolism through interaction networks to resist drugs. For example, after PBP is suppressed in MRSA, the network "changes course" to maintain cell wall synthesis. Cross-species interactions are equally important. Escherichia coli effector proteins bind to host actin to facilitate its invasion. Databases such as HPIDB have integrated such data, supporting the construction of host-pathogen integration networks (James and Munoz-Munoz, 2022), and promoting machine learning predictions of cross-species interactions. 2.3 Sources of protein interaction data and experimental verification methods A reliable PPI model cannot do without high-quality data. Positive samples mainly come from databases (BioGRID, IntAct, STRING) and literature experimental evidence. Homology inference is also an important supplement (Li and Ilie, 2017). Negative samples mostly rely on random selection or location difference method, which is noisy but practical. The prediction still needs experimental verification: Methods such as yeast two-hybrid, Co-IP, SPR, and ITC can confirm the interaction at different levels (Zhao et al., 2022). With the development of high-throughput mass spectrometry and protein chips, the verification efficiency has been continuously improved, which in turn has improved the data quality of the prediction model. 3 Principles and Classification of Machine Learning Methods in Protein-protein Interaction Prediction 3.1 Supervised learning methods Supervised learning is the earliest machine learning method used for PPI prediction. It distinguishes between "interaction" and "non-interaction" for the trained classification model through labeled proteins. SVM is a classic representative. It can divide samples in a high-dimensional space and is suitable for small sample data, but it relies on artificial feature design (Ding and Kihara, 2018). Random Forest (RF) classiizes through voting of multiple decision trees, can handle high-dimensional features and evaluate feature importance, and its predictive performance is superior to that of SVM). Linear models such as logistic regression are mostly used as baseline references. Traditional methods rely on feature engineering to combine features such as sequence similarity, physicochemical properties, and co-expression to improve accuracy (Zhang et al., 2019), but their performance is limited under complex data, laying the foundation for deep learning. 3.2 Unsupervised and semi-supervised learning methods Unsupervised and semi-supervised methods mine potential structures when the data lacks labels (Li and Ilie, 2017). Cluster analysis assumes that function-related proteins are more likely to interact and can detect modules, but the accuracy is affected by the threshold. Web-based link prediction algorithms that evaluate potential connections using common neighbors or random walks (Khemani et al., 2024) have been proven effective in
RkJQdWJsaXNoZXIy MjQ4ODYzNA==