CMB_2024v14n3

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 98 This includes genomic data, transcriptomic data, epigenomic data, proteomic data, metabolomic data, molecular imaging, molecular pathways, population data, and clinical/medical records (Li and Chen, 2014). The rapid development of high-throughput sequencing (HTS) techniques has significantly contributed to the generation of large-scale biological data, making it possible to profile biological systems in a cost-efficient manner (Greene et al., 2014). The data generated from these techniques are vast and complex, often requiring sophisticated tools and methodologies for effective analysis and interpretation. 2.2 Sources of big data in biology The primary sources of big data in biology include next-generation sequencing (NGS) technologies, which have revolutionized the field by enabling the generation of massive datasets that can answer long-standing questions about human diseases and biological processes (Mardis, 2016). Additionally, observational networks and space-based data have facilitated the discovery of emergent mechanisms and phenomena on regional and global scales, further contributing to the pool of big biological data (Xia et al., 2020). The Human Genome Project is a notable example, which utilized extensive resources and collaboration to sequence the human genome, a task that can now be accomplished much more rapidly and cost-effectively due to advancements in sequencing technologies (Li and Chen, 2014). 2.3 Challenges in handling biological big data Handling big biological data presents several challenges. One of the primary issues is the complexity and heterogeneity of the data, which requires integration from multiple autonomous sources (Wu et al., 2014). The volume, velocity, variety, and veracity of big data (the four V's) necessitate specialized theories and technologies for effective management and analysis (Li and Chen, 2014; Younas, 2019). Current data mining techniques often fall short in meeting the new space and time requirements posed by big data, highlighting the need for more robust and scalable solutions (Kamal et al., 2016). Moreover, the lack of standardized integration processes complicates the task of combining data from various sources into a unified format for analysis (Almasoud et al., 2020). The scientific community must also address issues related to data quality, security, and privacy to fully harness the potential of big data analytics in biological research (Wu et al., 2014; Chen et al., 2020). 3 Methods for Large-Scale Data Processing 3.1 Data storage and management 3.1.1 Distributed databases Distributed databases play a crucial role in managing large-scale biological data. Technologies such as Apache Hadoop provide distributed and parallelized data processing capabilities, which are essential for handling petabyte-scale datasets in genomics and other biological fields (O'Driscoll et al., 2013). These systems enable efficient storage, retrieval, and processing of vast amounts of data by distributing the workload across multiple nodes, thus enhancing performance and scalability. 3.1.2 Cloud computing solutions Cloud computing offers scalable and flexible solutions for storing and processing large biological datasets. Platforms like Sherlock leverage cloud technologies to provide a comprehensive data management system that supports data storage, conversion, querying, and sharing (Figure 1) (Bohár et al., 2022). Cloud-based solutions facilitate the handling of complex and large datasets by offering tools for distributed analytical queries and optimized storage formats, such as the Optimized Row Columnar (ORC) format, which enhances data processing efficiency. 3.1.3 Data security and privacy As biological data often contain sensitive information, ensuring data security and privacy is paramount. The HACE theorem and associated data-driven models emphasize the importance of incorporating security and privacy considerations into big data processing frameworks (Wu et al., 2014). These models advocate for robust security measures to protect data integrity and confidentiality while enabling efficient data mining and analysis.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==