Computational Molecular Biology 2024, Vol.14, No.4, 155-162 http://bioscipublisher.com/index.php/cmb 155 Review Article Open Access Big Data in Genomics: Overcoming Challenges Through High-Performance Computing Liting Wang , Haimei Wang Hainan Institute of Biotechnology, Haikou, 570206, Hainan, China Corresponding author: liting.wang@hitar.org Computational Molecular Biology, 2024, Vol.14, No.4 doi: 10.5376/cmb.2024.14.0018 Received: 01 Jun., 2024 Accepted: 15 Jul., 2024 Published: 31 Jul., 2024 Copyright © 2024 Wang and Wang, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Wang L.T., and Wang H.M., 2024, Big data in genomics: overcoming challenges through high-performance computing, Computational Molecular Biology, 14(4): 155-162 (doi: 10.5376/cmb.2024.14.0018) Abstract The rapid development of genomics has brought enormous challenges in storage, management, and processing of massive genomic data. High performance computing (HPC) technology aims to address key issues in genomics big data analysis. By introducing the applications of HPC in data storage, parallel computing, sequence alignment, and assembly, this study explores how to overcome the bottleneck of data analysis using HPC technology. The focus is on the application prospects of HPC in personalized medicine, evolutionary genomics, and population genetics, and looks forward to the potential of combining quantum computing and artificial intelligence with HPC in the future. Suggestions for further optimizing the application of HPC in the field of genomics are also proposed. Keywords Genomic big data; High-performance computing; Personalized medicine; Evolutionary genomics; Quantum computing 1 Introduction The advent of next-generation sequencing (NGS) technologies has revolutionized the field of genomics, leading to an unprecedented explosion of genomic data. These high-throughput technologies can generate billions of short DNA or RNA fragments, resulting in datasets that can exceed several terabytes in a single run. The decreasing cost of sequencing, now around $1 000 per genome, has made large-scale genomic projects feasible, further contributing to the data deluge (Schmidt and Hildebrandt, 2017). This massive influx of data presents significant challenges in terms of storage, management, and analysis (Wong, 2018; Xu, 2020). The complexity and volume of genomic data necessitate the development of sophisticated computational tools and algorithms to extract meaningful insights (Ward et al., 2013). High-Performance Computing (HPC) has emerged as a critical enabler in addressing the challenges posed by big data in genomics. HPC systems provide the computational power required to process and analyze large-scale genomic datasets efficiently. The integration of HPC with big data technologies, such as Apache Hadoop and cloud computing, allows for distributed and parallelized data processing, making it possible to handle petabyte-scale datasets (O'Driscoll et al., 2013). Moreover, HPC facilitates the development of advanced predictive analytics and deep learning models, which are essential for tasks such as gene prediction and the identification of genomic variants (Koumakis, 2020; Leung et al., 2020). The use of HPC in genomics not only accelerates data analysis but also enhances the accuracy and reliability of the results (Leung et al., 2020). Regarding the current status of big data in genomics and the crucial role of high-performance computing in overcoming related challenges, we will explore various computational methods and tools developed for managing and analyzing large genomic datasets, with a focus on their success and ongoing challenges. This study will discuss the future direction and potential progress of HPC and genomics integration, emphasizing the importance of collaborative methods and improving computing infrastructure. Identify the transformative impact of HPC on genomics research and its potential to drive future discoveries in personalized medicine and other related fields. 2 The Challenges of Big Data in Genomics 2.1 Data storage and management 2.1.1 Current storage technologies The rapid advancement in next-generation sequencing (NGS) technologies has led to an unprecedented increase in the volume of genomic data. This explosion of data presents significant challenges in terms of storage and
RkJQdWJsaXNoZXIy MjQ4ODYzNA==