CMB_2025v15n5

Computational Molecular Biology 2025, Vol.15, No.5, 218-226 http://bioscipublisher.com/index.php/cmb 22 0 species such as Mycobacterium tuberculosis. The autoencoder learns latent features through compression reconstruction (Gonzalez-Lopez et al., 2018), and the variational graph autoencoder (VGAE) can directly perform unsupervised link prediction. Semi-supervised models such as GCN can propagate label information in combination with a small number of labeled samples. Although the accuracy of these methods is not as good as that of deep supervision models, they are particularly valuable in small sample scenarios. 3.3 Innovative applications of deep learning and graph neural networks in PPI prediction Deep learning and graph neural networks (GNNS) have become new directions for PPI prediction. Sequence models such as PIPR (RCNN structure) or LSTM-CNN combined models significantly improve prediction performance. The pre-trained language models (ProtBERT, ESM) further enhanced the sequence representation (Charih et al., 2025), and the F1 values generally exceeded 0.8. The introduction of structural information and the development of AlphaFold2 have made structure-based prediction possible. GNN models such as GraphSAGE and GAT can directly learn topological features and predict missing edges on interaction networks. They can integrate sequence embeddings and network structures simultaneously, and have stronger generalization and interpretation capabilities (Khemani et al., 2024). In the future, the integration of heterogeneous maps and graph generation models will further enhance the accuracy and systematicness of pathogen interaction prediction. 4 Data Preprocessing and Feature Engineering 4.1 Sequence feature extraction Protein sequences are the core information for PPI prediction, but it is not easy to extract useful features. The earliest method statistically analyzed the amino acid composition, divalent or trivalent frequencies, but lost the sequence information. Later Conjoint triads were grouped according to physicochemical properties and retained the local sequence. Physicochemical properties such as hydrophobicity, charge, polarity, isoelectric point, etc. are also often used to distinguish protein types (Ding and Kihara, 2018). Evolutionary information further enhances predictive power. Interacting proteins often co-evolve and can be measured by conservation scores or phyletic profile similarity. In encoding, One-hot or embedding representations such as ProtVec and ProtBERT are commonly used (Charih et al., 2025). Multi-feature fusion (sequence + structure + conservation) is often superior to single feature, but the sequence features of different species need to be standardized before modeling. 4.2 Structural and functional characteristics (protein folding, domains, GO annotations) Structural and functional features reveal the interaction mechanism. Domain pairing is key to interaction, such as SH3 with polyproline motifs (Kotlyar et al., 2019). In machine learning, domains can be statistically co-occurring as binary features. Homologous modeling or molecular docking can obtain structural features such as interface energy and area. AlphaFold2 greatly expanded the structural data of pathogenic bacteria. Functional annotations (GO) reflect biological connections, and proteins with similar semantics are more likely to interact. Combining subcellular localization and pathway information can improve prediction accuracy, but functional similarity does not equal physical interaction. The model integrating sequence, domain and GO performed best in pathogenic bacteria (Sun et al., 2017), but feature redundancy needs to be prevented. 4.3 Data standardization and feature selection techniques (PCA, feature embedding, feature importance analysis) Data standardization and feature selection are the keys to modeling. The dimensions of different features vary greatly and require normalization or logarithmic transformation. PCA can reduce dimension and denoise, and embedding vectors can represent category features. Feature selection can use L1 regularization, feature importance, or recursive elimination to filter out key features. It is more effective to select features in combination with biological knowledge. For example, membrane proteins should retain hydrophobic characteristics. Missing values can be filled with the mean or labeled to avoid bias. Overall, in the prediction of pathogen PPI, standardized feature engineering and preprocessing often determine success or failure more than model complexity (Ding and Kihara, 2018).

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==