CMB_2024v14n4

Computational Molecular Biology 2024, Vol.14, No.4, 155-162 http://bioscipublisher.com/index.php/cmb 156 management. Traditional storage systems are often inadequate to handle the petabytes (PB) of data generated by high-throughput sequencing instruments. For instance, the storage of genomic data in plain text files can quickly become unmanageable due to the sheer size of the datasets, which can reach gigabytes (GB) per genome (Wong, 2018). Additionally, the need for secure storage solutions is paramount, as genomic data often contains sensitive information. Secure storage mechanisms, such as encrypted databases, have been proposed to address these concerns, offering both scalability and data protection (Wong et al., 2018). 2.1.2 Cloud computing and genomic databases Cloud computing has emerged as a viable solution to the storage and management challenges posed by big data in genomics. Cloud platforms offer scalable storage solutions that can flexibly expand to accommodate growing datasets. For example, cloud-based genomic databases can store and manage petabytes of data, freeing researchers from the burden of maintaining physical storage infrastructure (Yang, 2019). Moreover, cloud computing facilitates the use of distributed and parallelized data processing frameworks, such as Apache Hadoop, which can efficiently handle large-scale genomic data analysis (Yeo and Crawford, 2015). These cloud-based solutions not only provide the necessary computational power but also offer cost-effective and secure storage options, making them an attractive choice for the genomics community (Tariq et al., 2020). 2.2 Data integration across platforms One of the significant challenges in genomics is the integration of data across various platforms and technologies. Genomic data is often generated from multiple sources, including NGS, third-generation sequencing (TGS), and proteomics (Ellegren, 2014). The diversity of data types and formats complicates the integration process, making it difficult to perform comprehensive analyses. Effective data integration requires robust bioinformatics tools and platforms that can harmonize disparate datasets. For instance, proteogenomics, which combines proteomics and genomics data, faces scalability issues due to the large size of the integrated datasets. High-performance computing (HPC) solutions have been proposed to address these bottlenecks, ensuring that integrated analyses can be performed efficiently (Godhandaraman et al., 2017). Additionally, big data analytics platforms are being developed to facilitate the seamless integration and analysis of diverse genomic datasets, enabling more comprehensive and accurate genomic research (He et al., 2017). 2.3 Scalability issues in genomic research Scalability is a critical issue in genomic research, particularly as the volume of data continues to grow exponentially. Traditional bioinformatics tools and computing infrastructures often struggle to keep pace with the increasing data demands. To address these challenges, various computational strategies have been explored, including the use of parallel distributed computing and specialized hardware (Shi and Wang, 2019). For example, the MapReduce framework, implemented on platforms like Apache Hadoop, has shown promise in scaling genomic analysis workflows, such as short read sequence alignment and assembly (Yeo and Crawford, 2015). These frameworks enable the efficient processing of large datasets by distributing the computational load across multiple nodes, thereby enhancing scalability and performance. However, the development and optimization of these scalable solutions require ongoing research and innovation to keep up with the ever-growing data landscape in genomics (Godhandaraman et al., 2017; Xu, 2020). 3 High-Performance Computing in Genomics High-performance computing (HPC) has become indispensable in genomics due to the massive data generated by next-generation sequencing (NGS) technologies. The integration of HPC with genomics enables the efficient processing, analysis, and interpretation of large-scale genomic data, which is crucial for advancements in personalized medicine, evolutionary biology, and other fields. 3.1 Parallel computing for genomic data processing 3.1.1 Distributed algorithms for big data analysis Distributed algorithms are essential for managing and analyzing the vast amounts of data generated in genomics. These algorithms leverage multiple computing nodes to perform tasks concurrently, enhancing both speed and efficiency. For instance, the use of Message Passing Interface (MPI) in tools like QUARTIC allows for the

RkJQdWJsaXNoZXIy MjQ4ODYzNA==