Computational Molecular Biology 2025, Vol.15, No.5, 218-226 http://bioscipublisher.com/index.php/cmb 22 1 5 Construction and Evaluation of Machine Learning models 5.1 Training data and negative sample construction strategy Building a high-quality training set is the key to PPI prediction. Positive samples are generally from experimental databases such as BioGRID and IntAct, and the difficulty lies in negative samples. The random pairing method is commonly used (Chen et al., 2019), but it is prone to mix in undiscovered true interactions. Therefore, it is recommended to avoid functionally similar proteins or utilize subcellular localization differences. There are also strategies based on functional differences or excluding co-interacting partners, and even using semi-supervised models without explicitly labeling negative samples. To prevent data imbalance, positive and negative samples are often kept at 1:1 or 1:2, and undersampling or SMOTE balance is used. Hashemifar et al. (2018) proposed dynamic negative sample refreshing of the training set. If negative samples are mixed with true positivity, performance will be underestimated. When data is scarce, it can be compensated by cross-species or transfer learning. 5.2 Model evaluation metrics Commonly used metrics for model evaluation include accuracy rate, precision rate, recall rate, F1 and AUC. Accuracy fails when the data is unbalanced, so more attention is paid to precision (reducing false positives) and recall (discovering true positives). Drug screening focuses on accuracy, while network reconstruction emphasizes recall (Zhang et al., 2019). F1 combines the two, and AUC measures the overall discriminatory ability. The PR curve is more reliable when positive samples are scarce. Cross-validation (such as 50% fold, 10% fold) can prevent overfitting, while protein partitioning validation is closer to the actual prediction of new interaction scenarios. 5.3 Model interpretability and performance optimization methods Although deep learning is strong, its interpretability still attracts attention. The prediction basis can be explained by feature importance, attention weight, SHAP or LIME. Grad-CAM can also mark key residues (Figure 1) (Jumper et al., 2021). In terms of performance optimization, ensemble learning can enhance robustness, hyperparameter tuning (mesh, random, Bayesian search) and regularization (L2, dropout) to prevent overfitting (Jha et al., 2022). Transfer learning can alleviate the problem of scarce pathogenic bacteria data. Active learning verifies the stepwise improvement model of high uncertainty prediction through experiments. The ultimate goal is not merely to enhance the indicators, but to reveal the interaction patterns between pathogenic bacteria through interpretable and high-performance models, promoting the integration of computation and experimentation. 6 Application and Achievements in Predicting Protein-Protein Interactions of Pathogenic Bacteria 6.1 Application in the research of antibiotic resistance mechanisms Antibiotic resistance has become a global health crisis, and the PPI network provides an overall perspective for understanding its molecular mechanism (Maj and Trylska, 2025). In Mycobacterium tuberculosis, predictive networks reveal DNA repair and stress protein formation drug-resistant modules; In Staphylococcus aureus, β-lactam resistance protein interacts with cell wall enzymes to form a compensation circuit. This type of network analysis makes drug resistance factors no longer isolated phenomena. Interaction prediction can also identify new drug targets. For example, the interaction between Streptococcus pneumoniae MurA and topoisomerase IV is considered an interventionable bottleneck node. Furthermore, some drug-resistant mutations achieve resistance precisely by altering the protein-protein interaction interface. Comparing the interaction profiles of mutant and wild-type models can reveal this mechanism. These studies are driving anti-drug resistance strategies to shift from "inhibiting single targets" to "disrupting interaction networks", and have already shown effectiveness in Acinetobacter baumannii models. 6.2 Role in vaccine target screening and drug discovery PPI prediction also plays a role in vaccine and drug development. Interaction networks help identify functionally critical and structurally exposed antigens, improving the broad-spectrum efficacy of vaccines (Lian et al., 2019). For instance, Streptococcus pneumoniae PsaA interacts closely with PspC, and the combined immune effect is
RkJQdWJsaXNoZXIy MjQ4ODYzNA==