CMB_2024v14n3

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 99 Figure 1 Overview of how the query engine and the Data Lake work together (Adopted from Bohár et al., 2022) 3.2 Data integration and interoperability Integrating heterogeneous biological data from multiple sources is a significant challenge due to the diversity in data types and formats. Recent methods, such as non-negative matrix factorization-based approaches, have shown promise in effectively integrating various types of networked biological data, providing more holistic insights into biological systems (Gligorijević and Przulj, 2015). Additionally, frameworks that utilize domain ontology and distributed processing have been proposed to achieve seamless data integration, ensuring logical consistency and facilitating further research and analysis (Almasoud et al., 2020). 3.3 Data cleaning and preprocessing Data cleaning and preprocessing are critical steps in preparing large-scale biological data for analysis. Tools like TBtools offer user-friendly interfaces and a wide range of functions for bulk sequence processing and interactive data visualization, making it easier for biologists to handle big data without extensive programming knowledge (Chen et al., 2020). Moreover, methodologies such as the MapReduce-based k-nearest neighbor (K-NN) classification approach have been developed to reduce data imbalance and enhance the efficiency of data classification and storage management (Kamal et al., 2016). 4 Analytical Techniques for Big Data 4.1 Machine learning algorithms 4.1.1 Supervised learning Supervised learning algorithms are a cornerstone of big data analytics in biology, where labeled datasets are used to train models to make predictions or classify data. Common supervised learning techniques include linear regression, logistic regression, support vector machines (SVM), and random forests. These methods have been effectively applied to various biological datasets, such as protein-coding data for disease identification and treatment (Rahman, 2019). The use of supervised learning in bioinformatics allows for the development of predictive models that can provide insights into complex biological processes and disease mechanisms (Greene et al., 2014). 4.1.2 Unsupervised learning Unsupervised learning algorithms are essential for analyzing large-scale biological data where labels are not available. Techniques such as clustering, principal component analysis (PCA), and hierarchical clustering help in identifying patterns and structures within the data. These methods are particularly useful in the initial stages of data exploration and for discovering hidden relationships in biological networks (Greene et al., 2014). Unsupervised learning has been applied to various biological datasets to uncover novel insights and generate hypotheses for further investigation (Jan et al., 2017). 4.1.3 Deep learning approaches Deep learning, a subset of machine learning, has gained significant traction in the field of big data analytics due to its ability to handle large, complex, and heterogeneous datasets. Deep learning models, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and autoencoders, have been successfully applied to biological data for tasks such as image classification, sequence analysis, and network prediction (Najafabadi et al., 2015; Tonidandel et al., 2018; Jin et al., 2020). These models can extract high-level features from raw data,

RkJQdWJsaXNoZXIy MjQ4ODYzNA==