International Journal of Molecular Medical Science, 2024, Vol.14, No.5, 274-292 http://medscipublisher.com/index.php/ijmms 279 4 Integrative Analysis Techniques 4.1 Data preprocessing and quality control Data preprocessing and quality control are critical steps in multi-omics data integration to ensure the reliability and accuracy of downstream analyses. High-throughput sequencing technologies generate vast amounts of data, which often contain noise and biases. Effective preprocessing involves normalization, imputation of missing values, and batch effect correction. For instance, a Denoising Autoencoder was employed to handle data noise and extract robust features from multi-omics data, enhancing the accuracy of cancer prognosis prediction (Chai et al., 2021). 4.2 Statistical and computational methods for data integration 4.2.1 Correlation-based approaches Correlation-based approaches are fundamental in identifying relationships between different omics layers. These methods often involve calculating correlation coefficients to determine the strength and direction of associations between variables. For example, Zhao et al. (2020) utilized network analysis to identify genes with broad correlations across various cancers, demonstrating the utility of correlation-based methods in multi-omics data integration. Xu et al. (2019) proposed a method called High-Order Pathway Elucidation Similarity (HOPES) to identify cancer subtypes by querying multi-omics data simultaneously. They used this method to identify gene expression (GE), DNA methylation (DM), and mutation (ME) data across five TCGA cancers and further validated its reliability and clinical significance (Xu et al., 2019). 4.2.2 Machine learning techniques Machine learning techniques have become indispensable in multi-omics data integration due to their ability to handle high-dimensional data and uncover complex patterns. Various machine learning algorithms, such as clustering, classification, and regression, are employed to integrate and analyze multi-omics data. In research of Reel et al., machine learning methods were reviewed for their effectiveness in integrating omics data to discover new biomarkers and improve disease prediction and patient stratification (Reel et al., 2021). Additionally, Chai et al. highlighted the use of deep learning techniques, such as Autoencoders, to integrate multi-omics data for accurate cancer prognosis prediction (Chai et al., 2021). Wang et al. used Similarity Network Fusion (SNF) to combine mRNA expression, DNA methylation, and microRNA (miRNA) expression data from five cancer datasets. SNF significantly outperformed single data type analyses and established integrative methods in identifying cancer subtypes and effectively predicting survival rates (Wang et al., 2014). Chu employed 10 clustering algorithms to synthesize multi-omics data from patients with muscle-invasive urothelial carcinoma (MUC), and combined this with 10 machine learning algorithms to identify high-resolution molecular subgroups and develop a robust consensus machine learning-driven signature (CMLS) with strong prognostic prediction capabilities (Chu et al., 2023). 4.2.3 Network-based integration Network-based integration methods leverage the interconnected nature of biological systems to integrate multi-omics data. These approaches construct networks where nodes represent biomolecules, and edges represent interactions or correlations. Constructing networks can provide a more comprehensive understanding of specific diseases and their biological processes, which is beneficial for identifying cancer subtypes and prognostic biomarkers. For instance, Wang et al. proposed a multiplex network-based approach to integrate heterogeneous omics data, achieving high performance in identifying cancer subtypes (Wang et al., 2016). Similarly, Zhao et al. used network analysis to identify prognostic biomarkers with broad correlations across different cancers (Zhao et al., 2020). 4.2.4 Multi-omics data fusion Multi-omics data fusion involves combining different types of omics data to create a comprehensive view of the biological system. This approach can enhance the identification of biomarkers and improve disease prognosis. Kim et al. (2013) conducted a series of studies using the TCGA dataset to identify interactions between multi-omics data and link these interactions to cancer clinical outcomes (Kim et al., 2013; Kim et al., 2014;
RkJQdWJsaXNoZXIy MjQ4ODYzNQ==