Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 97 Review Article Open Access Big Data Analytics in Biology: A Systematic Review of Methods for Large-Scale Data Processing Weipan Wang, Bing Zhang, Manman Li Hainan Institute of Biotechnology, Haikou, 570206, Hainan, China Corresponding author: manman li@hibio.org Computational Molecular Biology, 2024, Vol.14, No.3 doi: 10.5376/cmb.2024.14.0012 Received: 29 Mar., 2024 Accepted: 22 May, 2024 Published: 02 Jun., 2024 Copyright © 2024 Wang et al., This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Wang W.P., Zhang B., and Li M.M., 2024, Big data analytics in biology: a systematic review of methods for large-scale data processing, Computational Molecular Biology, 14(3): 97-105 (doi: 10.5376/cmb.2024.14.0012) Abstract This study explores various methods and tools developed for large-scale data processing in biological research. We studied comprehensive toolkits such as TBtools, which provide user-friendly interfaces for complex data analysis, as well as distributed computing frameworks such as MapReduce, which solve the problem of imbalance in large DNA datasets. In addition, we discussed the challenges posed by the heterogeneity and complexity of big biological data, emphasizing the need for powerful and scalable analytical frameworks, such as bigSCale for single-cell RNA sequencing, in order to gain a comprehensive understanding of the current status and future directions of big data analysis in the field of biology. Keywords Big data analytics; Bioinformatics; High-throughput sequencing; Machine learning; Distributed computing 1 Introduction The advent of high-throughput technologies has revolutionized the field of biology, ushering in the era of "big data." This transformation is characterized by the generation of vast amounts of data across various biological domains, including genomics, transcriptomics, proteomics, and metabolomics (Davis-Turak et al., 2017). The Human Genome Project, for instance, exemplifies the scale of data generation, having taken 13 years and over $3 billion to sequence the human genome, a task that can now be accomplished in a few days for a fraction of the cost (Li and Chen, 2014; Goh and Wong, 2020). The rapid accumulation of biological data has necessitated the development of sophisticated tools and techniques to manage, analyze, and interpret these large datasets (Greene et al., 2014; Chen et al., 2020). The ability to process and analyze large-scale biological data is crucial for advancing our understanding of complex biological systems and translating this knowledge into practical applications. High-dimensional data spaces, such as those generated by genomic and proteomic technologies, present unique challenges in terms of data integration, analysis, and interpretation (Clarke et al., 2008). Effective data processing methods enable researchers to uncover hidden biological regularities, understand cellular processes, and develop predictive models for disease diagnosis and treatment (Ebrahim et al., 2016; Gutierrez et al., 2018). Moreover, the integration of multi-omic data provides a comprehensive view of biological systems, facilitating the discovery of novel insights that would be unattainable through single-omic approaches (Tariq et al., 2020; Juan and Huang, 2023). This study provides a comprehensive overview of the methods and tools currently used for large-scale data processing in biology. By studying the challenges and opportunities related to big data in life sciences, we emphasize the advancements in data integration, quantitative analysis, and computing technologies that drive the field forward. In addition, this study will discuss the impact of these methods on future research and their potential applications in clinical and translational medicine, identify gaps in current methods, and propose directions for future research to improve the scalability and efficiency of biological big data analysis. 2 Overview of Big Data in Biological Research 2.1 Types of biological data In the "Omics" era of life sciences, biological data is diverse and encompasses various levels of biological systems.
RkJQdWJsaXNoZXIy MjQ4ODYzNA==