CMB_2024v14n3

Computational Molecular Biology 2024, Vol.14, No.3 http://bioscipublisher.com/index.php/cmb © 2024 BioSci Publisher, registered at the publishing platform that is operated by Sophia Publishing Group, founded in British Columbia of Canada. All Rights Reserved. BioSci Publisher is an international Open Access publishing platform that publishes scientific journals in the field of bioscience registered at the publishing platform that is operated by Sophia Publishing Group (SPG), founded in British Columbia of Canada. BioSci Publisher Publisher BioSci Publisher Editedby Editorial Team of Computational Molecular Biology Email: edit@cmb.bioscipublisher.com Website: http://bioscipublisher.com/index.php/cmb Address: 11388 Stevenston Hwy, PO Box 96016, Richmond, V7A 5J5, British Columbia Canada Computational Molecular Biology (ISSN 1927-5587) is an open access, peer reviewed journal published online by BioSciPublisher. The Journal is publishing all the latest and outstanding research articles, letters, methods, and reviews in all areas of computational molecular biology, covering new discoveries in molecular biology, from genes to genomes, using statistical, mathematical, and computational methods as well as new development of computational methods and databases in molecular and genome biology. The papers published in the journal are expected to be of interests to computational scientists, biologists and teachers/students/researchers engaged in biology. All the articles published in Computational Molecular Biology are Open Access, and are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BioSciPublisher uses CrossCheck service to identify academic plagiarism through the world’s leading plagiarism prevention tool, iParadigms, and to protect the original authors’ copyrights.

Computational Molecular Biology (online), 2024, Vol. 14 ISSN 1927-6648 http://hortherbpublisher.com/index.php/cmb © 2024 BioSc iPublisher, registered at the publishing platform that is operated by Sophia Publishing Group, founded in British Columbia of Canada. All Rights Reserved. Latest Content 2024, Vol. 14, No.3 【Editorial】 Embracing AI's Role in Structural Biology While Recognizing Its Limits 95-96 DOI: 10.5376/cmb.2024.14.0011 【Review Article】 Big Data Analytics in Biology: A Systematic Review of Methods for Large-Scale Data Processing 97-105 Weipan Wang, Bing Zhang, Manman Li DOI: 10.5376/cmb.2024.14.0012 【Research Insight】 AI in Biology: Transforming Genomic Research with Machine Learning 106-114 Qiang Zhang, Yu Wang DOI: 10.5376/cmb.2024.14.0013 【Feature Review】 The Role of Computational Chemistry in Modern Drug Discovery: Challenges and Prospects 115-124 Xiaohua Zhang, Jianhui Li DOI: 10.5376/cmb.2024.14.0014 【Research and Progress】 Advances in Biomechanics: Exploring Biophysical Models in Cellular Mechanics 125-133 Xicheng Yang, Jie Gao DOI: 10.5376/cmb.2024.14.0015

Computational Molecular Biology 2024 Vol.14, No.3, 95-96 http://bioscipublisher.com/index.php/cmb 95 Editoroal Open Access Embracing AI's Role in Structural Biology While Recognizing Its Limits Computational Molecular Biology, 2024, Vol.14, No.3 doi: 10.5376/cmb.2024.14.0011 This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: 2024, Embracing AI's role in structural biology while recognizing its limits, Computational Molecular Biology, 14(3): 95-96 (doi: 10.5376/cmb.2024.14.0011) he realm of structural biology has witnessed an unprecedented transformation with the advent of artificial intelligence (AI) technologies, most notably through the work of AlphaFold. On May 8, 2024, a pivotal paper by Abramson, Adler, Dunger, et al., published in Nature, demonstrated the impressive capabilities of AlphaFold 3 in accurately predicting biomolecular interactions (Abramson et al., 2024). This breakthrough represents a significant leap in our ability to decipher the intricate architecture of proteins, a fundamental building block of life. Yet, as we celebrate this achievement, it is imperative to consider the diverse perspectives within the scientific community regarding the role and limitations of AI in this field. Dr. Shi Yigong, President of Westlake University and renowned structural biologist, has been a vocal advocate for the transformative potential of AI in structural biology. He highlights the staggering progress AI has enabled, noting that within a mere few years, AI has predicted the structures of over 600,000,000 or 700 million proteins, vastly expanding our structural database from 220,000 experimentally determined structures to an almost unfathomable scale. Shi asserts, “AI’s rapid advancements have fundamentally altered our understanding of protein structures, offering a database that is several orders of magnitude larger than what we had before. This scale of change inevitably influences our comprehension of life sciences, drug discovery, and disease treatment” (Credit: Tai Media AGI, Video ID: sphMGEP2FvbOKcq). Shi's enthusiasm stems from the ability of AI to expedite processes that previously required extensive time and resources. He recounts how tasks that once took his team of doctoral students years to complete can now be accomplished in mere weeks with AI assistance. This acceleration allows researchers to shift their focus from the painstaking process of data collection to the more intellectually stimulating tasks of hypothesis testing and functional exploration. Contrastingly, Yan Ning, a distinguished structural biologist and former professor at Princeton University, now serving as the dean of President of Shenzhen Medical Academy of Research and Translation, adopts a more cautious stance. While acknowledging AI’s impressive achievements, Yan emphasizes the current limitations of AI, particularly its inability to fully replicate the dynamic nature of biomolecular structures. She underscores the importance of experimental validation and the discovery of multiple conformations of proteins, which are crucial for understanding their functional mechanisms and identifying new drug targets. Yan remarks, “AI can predict a static structure, but the true beauty and complexity of proteins lie in their dynamic states. To truly understand a protein’s function, we must observe it in various conformations, something AI currently struggles with” (Credit: Tai Media AGI, Video ID: sphMGEP2FvbOKcq ). Yan’s reservations are grounded in her extensive experience and the unpredictable nature of experimental science. She cites instances where experimental techniques have revealed unexpected conformations and interactions that AI predictions missed, highlighting the indispensable role of empirical research. Her insights T To unlock the full potential of scientific discovery, we must leverage the power ofAIwhile maintaining the rigorof experimental validation.

Computational Molecular Biology 2024 Vol.14, No.3, 95-96 http://bioscipublisher.com/index.php/cmb 96 remind us that while AI provides powerful tools, it cannot yet replace the nuanced understanding gained through hands-on experimentation. The discourse between these two leading scientists encapsulates the dual nature of AI's impact on structural biology. On one hand, AI represents a monumental leap forward, dramatically enhancing our data acquisition capabilities and opening new avenues for scientific inquiry. On the other, it underscores the continued necessity of traditional experimental approaches to validate and expand upon AI-generated predictions. As we move forward, the integration of AI in structural biology should be approached with a balanced perspective. Embracing AI’s capabilities while remaining vigilant about its limitations will be key to advancing our understanding of biomolecular structures and their functions. By fostering a collaborative environment where AI and experimental techniques complement each other, we can unlock new levels of scientific discovery and innovation. In conclusion, the contributions of AI, as showcased by AlphaFold 3, are undeniably transformative. However, the insights from both Professor Shi Yigong and Yan Ning remind us that the path to comprehensive biological understanding is multifaceted, requiring both computational prowess and experimental diligence. Together, these approaches hold the promise of a future where the mysteries of life at the molecular level are unraveled with unprecedented clarity and precision. References Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A.J., Bambrick J., Bodenstein S.W., Evans D.A., Chia C.H., O’Neill M., Reiman D., Tunyasuvunakool K., Wu Z., Žemgulytė A., Arvaniti E., Beattie C., Bertolli O., Bridgland A., Cherepanov A., Congreve M., Cowen-Rivers A.I., Cowie A., Figurnov M., Fuchs F.B., Gladman H., Jain R., Khan Y.A., Low C.M.R., Perlin K., Potapenko A., Savy P., Singh S., Stecula A., Thillaisundaram A., Tong C., Yakneen S., Zhong E.D., Zielinski M., Žídek A., Bapst V., KohliP., Jaderberg M., Hassabis D., and Jumper J.M., 2024, Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature, 1-3. https://doi.org/10.1038/s41586-024-07487-w

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 97 Review Article Open Access Big Data Analytics in Biology: A Systematic Review of Methods for Large-Scale Data Processing Weipan Wang, Bing Zhang, Manman Li Hainan Institute of Biotechnology, Haikou, 570206, Hainan, China Corresponding author: manman li@hibio.org Computational Molecular Biology, 2024, Vol.14, No.3 doi: 10.5376/cmb.2024.14.0012 Received: 29 Mar., 2024 Accepted: 22 May, 2024 Published: 02 Jun., 2024 Copyright © 2024 Wang et al., This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Wang W.P., Zhang B., and Li M.M., 2024, Big data analytics in biology: a systematic review of methods for large-scale data processing, Computational Molecular Biology, 14(3): 97-105 (doi: 10.5376/cmb.2024.14.0012) Abstract This study explores various methods and tools developed for large-scale data processing in biological research. We studied comprehensive toolkits such as TBtools, which provide user-friendly interfaces for complex data analysis, as well as distributed computing frameworks such as MapReduce, which solve the problem of imbalance in large DNA datasets. In addition, we discussed the challenges posed by the heterogeneity and complexity of big biological data, emphasizing the need for powerful and scalable analytical frameworks, such as bigSCale for single-cell RNA sequencing, in order to gain a comprehensive understanding of the current status and future directions of big data analysis in the field of biology. Keywords Big data analytics; Bioinformatics; High-throughput sequencing; Machine learning; Distributed computing 1 Introduction The advent of high-throughput technologies has revolutionized the field of biology, ushering in the era of "big data." This transformation is characterized by the generation of vast amounts of data across various biological domains, including genomics, transcriptomics, proteomics, and metabolomics (Davis-Turak et al., 2017). The Human Genome Project, for instance, exemplifies the scale of data generation, having taken 13 years and over $3 billion to sequence the human genome, a task that can now be accomplished in a few days for a fraction of the cost (Li and Chen, 2014; Goh and Wong, 2020). The rapid accumulation of biological data has necessitated the development of sophisticated tools and techniques to manage, analyze, and interpret these large datasets (Greene et al., 2014; Chen et al., 2020). The ability to process and analyze large-scale biological data is crucial for advancing our understanding of complex biological systems and translating this knowledge into practical applications. High-dimensional data spaces, such as those generated by genomic and proteomic technologies, present unique challenges in terms of data integration, analysis, and interpretation (Clarke et al., 2008). Effective data processing methods enable researchers to uncover hidden biological regularities, understand cellular processes, and develop predictive models for disease diagnosis and treatment (Ebrahim et al., 2016; Gutierrez et al., 2018). Moreover, the integration of multi-omic data provides a comprehensive view of biological systems, facilitating the discovery of novel insights that would be unattainable through single-omic approaches (Tariq et al., 2020; Juan and Huang, 2023). This study provides a comprehensive overview of the methods and tools currently used for large-scale data processing in biology. By studying the challenges and opportunities related to big data in life sciences, we emphasize the advancements in data integration, quantitative analysis, and computing technologies that drive the field forward. In addition, this study will discuss the impact of these methods on future research and their potential applications in clinical and translational medicine, identify gaps in current methods, and propose directions for future research to improve the scalability and efficiency of biological big data analysis. 2 Overview of Big Data in Biological Research 2.1 Types of biological data In the "Omics" era of life sciences, biological data is diverse and encompasses various levels of biological systems.

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 98 This includes genomic data, transcriptomic data, epigenomic data, proteomic data, metabolomic data, molecular imaging, molecular pathways, population data, and clinical/medical records (Li and Chen, 2014). The rapid development of high-throughput sequencing (HTS) techniques has significantly contributed to the generation of large-scale biological data, making it possible to profile biological systems in a cost-efficient manner (Greene et al., 2014). The data generated from these techniques are vast and complex, often requiring sophisticated tools and methodologies for effective analysis and interpretation. 2.2 Sources of big data in biology The primary sources of big data in biology include next-generation sequencing (NGS) technologies, which have revolutionized the field by enabling the generation of massive datasets that can answer long-standing questions about human diseases and biological processes (Mardis, 2016). Additionally, observational networks and space-based data have facilitated the discovery of emergent mechanisms and phenomena on regional and global scales, further contributing to the pool of big biological data (Xia et al., 2020). The Human Genome Project is a notable example, which utilized extensive resources and collaboration to sequence the human genome, a task that can now be accomplished much more rapidly and cost-effectively due to advancements in sequencing technologies (Li and Chen, 2014). 2.3 Challenges in handling biological big data Handling big biological data presents several challenges. One of the primary issues is the complexity and heterogeneity of the data, which requires integration from multiple autonomous sources (Wu et al., 2014). The volume, velocity, variety, and veracity of big data (the four V's) necessitate specialized theories and technologies for effective management and analysis (Li and Chen, 2014; Younas, 2019). Current data mining techniques often fall short in meeting the new space and time requirements posed by big data, highlighting the need for more robust and scalable solutions (Kamal et al., 2016). Moreover, the lack of standardized integration processes complicates the task of combining data from various sources into a unified format for analysis (Almasoud et al., 2020). The scientific community must also address issues related to data quality, security, and privacy to fully harness the potential of big data analytics in biological research (Wu et al., 2014; Chen et al., 2020). 3 Methods for Large-Scale Data Processing 3.1 Data storage and management 3.1.1 Distributed databases Distributed databases play a crucial role in managing large-scale biological data. Technologies such as Apache Hadoop provide distributed and parallelized data processing capabilities, which are essential for handling petabyte-scale datasets in genomics and other biological fields (O'Driscoll et al., 2013). These systems enable efficient storage, retrieval, and processing of vast amounts of data by distributing the workload across multiple nodes, thus enhancing performance and scalability. 3.1.2 Cloud computing solutions Cloud computing offers scalable and flexible solutions for storing and processing large biological datasets. Platforms like Sherlock leverage cloud technologies to provide a comprehensive data management system that supports data storage, conversion, querying, and sharing (Figure 1) (Bohár et al., 2022). Cloud-based solutions facilitate the handling of complex and large datasets by offering tools for distributed analytical queries and optimized storage formats, such as the Optimized Row Columnar (ORC) format, which enhances data processing efficiency. 3.1.3 Data security and privacy As biological data often contain sensitive information, ensuring data security and privacy is paramount. The HACE theorem and associated data-driven models emphasize the importance of incorporating security and privacy considerations into big data processing frameworks (Wu et al., 2014). These models advocate for robust security measures to protect data integrity and confidentiality while enabling efficient data mining and analysis.

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 99 Figure 1 Overview of how the query engine and the Data Lake work together (Adopted from Bohár et al., 2022) 3.2 Data integration and interoperability Integrating heterogeneous biological data from multiple sources is a significant challenge due to the diversity in data types and formats. Recent methods, such as non-negative matrix factorization-based approaches, have shown promise in effectively integrating various types of networked biological data, providing more holistic insights into biological systems (Gligorijević and Przulj, 2015). Additionally, frameworks that utilize domain ontology and distributed processing have been proposed to achieve seamless data integration, ensuring logical consistency and facilitating further research and analysis (Almasoud et al., 2020). 3.3 Data cleaning and preprocessing Data cleaning and preprocessing are critical steps in preparing large-scale biological data for analysis. Tools like TBtools offer user-friendly interfaces and a wide range of functions for bulk sequence processing and interactive data visualization, making it easier for biologists to handle big data without extensive programming knowledge (Chen et al., 2020). Moreover, methodologies such as the MapReduce-based k-nearest neighbor (K-NN) classification approach have been developed to reduce data imbalance and enhance the efficiency of data classification and storage management (Kamal et al., 2016). 4 Analytical Techniques for Big Data 4.1 Machine learning algorithms 4.1.1 Supervised learning Supervised learning algorithms are a cornerstone of big data analytics in biology, where labeled datasets are used to train models to make predictions or classify data. Common supervised learning techniques include linear regression, logistic regression, support vector machines (SVM), and random forests. These methods have been effectively applied to various biological datasets, such as protein-coding data for disease identification and treatment (Rahman, 2019). The use of supervised learning in bioinformatics allows for the development of predictive models that can provide insights into complex biological processes and disease mechanisms (Greene et al., 2014). 4.1.2 Unsupervised learning Unsupervised learning algorithms are essential for analyzing large-scale biological data where labels are not available. Techniques such as clustering, principal component analysis (PCA), and hierarchical clustering help in identifying patterns and structures within the data. These methods are particularly useful in the initial stages of data exploration and for discovering hidden relationships in biological networks (Greene et al., 2014). Unsupervised learning has been applied to various biological datasets to uncover novel insights and generate hypotheses for further investigation (Jan et al., 2017). 4.1.3 Deep learning approaches Deep learning, a subset of machine learning, has gained significant traction in the field of big data analytics due to its ability to handle large, complex, and heterogeneous datasets. Deep learning models, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and autoencoders, have been successfully applied to biological data for tasks such as image classification, sequence analysis, and network prediction (Najafabadi et al., 2015; Tonidandel et al., 2018; Jin et al., 2020). These models can extract high-level features from raw data,

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 100 enabling the discovery of intricate patterns and relationships that traditional methods might miss. Deep learning has shown promise in addressing challenges in big data analytics, including scalability, high-dimensional data, and the integration of diverse data types (Najafabadi et al., 2015; Shukla et al., 2021). 4.2 Statistical methods Statistical methods play a crucial role in the preprocessing and analysis of big data in biology. Techniques such as data normalization, transformation, and noise reduction are essential for preparing data for further analysis. Methods like the Box-Cox transformation and linear transformation have been shown to improve the performance of machine learning algorithms by making the data more consistent and noise-free (Rahman, 2019). Additionally, statistical models such as the hidden Markov model (HMM) are used for sequence analysis and have demonstrated high accuracy and reliability in biological data analysis (Rahman, 2019). 4.3 Network analysis Network analysis is a powerful tool for understanding the complex interactions within biological systems. By representing biological entities (e.g., genes, proteins) as nodes and their interactions as edges, network analysis can reveal the underlying structure and dynamics of biological networks. Techniques such as graph-based algorithms and network-based clustering are used to identify key components and modules within these networks (Kashyap et al., 2015; Jin et al., 2020). Deep learning approaches have also been integrated with network analysis to handle large and heterogeneous graph data structures, enabling the extraction of meaningful information from complex biological networks (JaseenaK and Kovoor, 2018; Jin et al., 2020). This integration has facilitated advancements in areas such as disease network analysis, drug discovery, and the identification of therapeutic targets (Kashyap et al., 2015; Jin et al., 2020). 5 Applications of Big Data Analytics in Biology 5.1 Genomics and transcriptomics Big data analytics has significantly impacted the fields of genomics and transcriptomics, enabling researchers to handle and interpret vast amounts of data generated by high-throughput sequencing technologies. The integration of big data analytics in genomics has facilitated the rapid sequencing of genomes, which was exemplified by the Human Genome Project. This project, which initially took 13 years and over $3 billion, can now be accomplished in just a few days for a fraction of the cost (Li and Chen, 2014). The development of next-generation sequencing (NGS) technologies, such as whole-genome sequencing (WGS) and whole-exome sequencing (WES), has further accelerated the generation of genomic data, allowing for comprehensive studies of genetic variations and their implications in various biological processes and diseases (Hien et al., 2021). Machine learning algorithms have been particularly useful in the analysis of genomic data, providing tools for the annotation of sequence elements and the integration of epigenetic, proteomic, and metabolomic data (Libbrecht and Noble, 2015). These algorithms help in identifying clinically actionable genetic variants, which are crucial for the development of personalized medicine (He et al., 2017). The integration of genomic data with electronic health records (EHRs) has also opened new avenues for individualized diagnostic and therapeutic strategies, although it presents challenges in data manipulation and management (He et al., 2017). 5.2 Proteomics and metabolomics Proteomics and metabolomics are other critical areas where big data analytics have made substantial contributions. The advancements in mass spectrometry and other analytical methods have increased the intersection between proteomics and big data science, enabling the generation of large-scale proteomic and metabolomic datasets (Perez-Riverol and Moreno, 2019). The integration of these datasets with transcriptomic data provides a more comprehensive understanding of biological systems, as it allows for the analysis of gene expression, protein translation, and post-translational modifications in a unified manner (Kumar et al., 2016). High-throughput strategies, such as the sample preparation for multi-omics technologies (SPOT), have been developed to enhance the efficiency of multiomic analyses. These strategies enable the simultaneous analysis of transcriptomic, proteomic, and metabolomic data from a common sample, thereby reducing the resources required

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 101 and increasing the throughput of multiomic experiments (Gutierrez et al., 2018). Additionally, bioinformatics tools like Metabox facilitate the deep phenotyping analytics of metabolomic data, supporting its integration with proteomic and transcriptomic contexts (Wanichthanarak et al., 2017). The use of software containers and workflow environments, such as Galaxy and Nextflow, has further improved the scalability and reproducibility of proteomic and metabolomic data analysis. These tools allow for the distribution of analytics tasks across multiple computational resources, addressing the challenges of handling large and complex datasets (Perez-Riverol and Moreno, 2019). The integration of these high-throughput and scalable approaches is essential for addressing complex clinical and biological questions, ultimately leading to a better understanding of disease mechanisms and the identification of potential therapeutic targets (Gutierrez et al., 2018; Perez-Riverol and Moreno, 2019). 6 Tools and Platforms for Biological Big Data 6.1 Open-source tools Open-source tools have become indispensable in the realm of biological big data due to their flexibility, cost-effectiveness, and community-driven development. One notable example is TBtools, a comprehensive toolkit designed for interactive analyses of big biological data. TBtools offers over 100 functions for tasks ranging from bulk sequence processing to interactive data visualization, all within a user-friendly interface. This platform-independent software is freely available and supports various operating systems with Java Runtime Environment 1.6 or newer (Figure 2) (Chen et al., 2020). Figure 2 The Powerful Plotting Engine “JIGplot” in TBtools Displays Great Interactability (Adopted from Chen et al., 2020)

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 102 Another significant open-source platform is Sherlock, which addresses the challenges of data collection, storage, and analysis in computational biology. Sherlock leverages modern big data technologies like Docker and PrestoDB to enable users to manage, query, and share large and complex datasets efficiently. It supports various structured data types and converts them into optimized storage formats, facilitating quick and efficient distributed analytical queries (Bohár et al., 2022). OpenBIS is another flexible open-source framework designed for managing and analyzing complex biological data. It allows users to collect, integrate, share, and publish data while connecting to data processing pipelines. openBIS is highly scalable and customizable, making it suitable for a wide range of biological data types and research domains (Bauch et al., 2011). PipeCraft is a flexible toolkit specifically designed for the bioinformatics analysis of high-throughput amplicon sequencing data. It provides a user-friendly graphical interface that links several public tools, allowing users to customize their analysis pipelines according to their specific needs. PipeCraft supports various sequencing platforms and ensures easy customization and traceability of analytical steps (Anslan et al., 2017). 6.2 Commercial software solutions Commercial software solutions for biological big data often provide robust, enterprise-level support and advanced features that may not be available in open-source tools. These solutions are designed to handle the vast amounts of data generated by modern biological research and offer comprehensive support for data analysis, storage, and management. While the provided data does not include specific examples of commercial software solutions, it is important to note that these solutions typically offer enhanced performance, scalability, and integration capabilities. They often come with dedicated customer support, regular updates, and compliance with industry standards, making them suitable for large-scale and mission-critical applications in biological research. 6.3 Customized pipelines Customized pipelines are essential for addressing the unique requirements of specific biological research projects. These pipelines often integrate multiple software tools and platforms to create tailored workflows that can handle the complexity and scale of big biological data. The use of application containers and workflows, such as those enabled by Docker, has revolutionized the deployment and reproducibility of computational experiments in genomics. By isolating applications and creating secure, scalable platforms, researchers can significantly reduce the time needed for data analysis and improve the reproducibility of their experiments (Schulz et al., 2016). High-performance computing (HPC) platforms also play a crucial role in customized pipelines for big biological data analysis. These platforms provide the computational power needed to handle the complexity and volume of biological data, enabling researchers to gain deeper insights into biological functions. HPC platforms are particularly useful for tasks such as genomic sequencing data analysis and protein structure analysis, where traditional computing platforms may fall short (Yin et al., 2017; Yeh et al., 2023). 7 Challenges and Future Directions 7.1 Scalability and performance issues The rapid growth of biological data, driven by advancements in high-throughput sequencing technologies, has outpaced the capabilities of traditional data analysis platforms. This has necessitated the development of high-performance computing (HPC) platforms and scalable algorithms to handle the massive computational demands of big biological data analytics (Yin et al., 2017). The scalability of bioinformatics software is a critical concern, as it must efficiently manage increasing workloads. Modern cloud computing frameworks like MapReduce and Spark have been employed to implement divide-and-conquer strategies in distributed computing environments, addressing these scalability challenges (Yang et al., 2017). However, ensuring the validity of computational outputs remains a significant issue, requiring robust software testing techniques such as metamorphic testing to ensure the accuracy and reliability of bioinformatics tools (Yang et al., 2017). 7.2 Integration of multimodal data The integration of multimodal data, particularly in single-cell biology, presents a considerable challenge due to the

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 103 complexity and heterogeneity of the data involved. Single-cell techniques now enable the simultaneous measurement of multiple data modalities, providing new insights into biological processes that cannot be inferred from a single mode of assay. However, integrating these complex datasets into coherent biological models requires sophisticated computational methods and data visualization approaches (Miao et al., 2021). Strategies for integrating matched data (measured on the same cell) include joint latent space inference and biological causal modeling, while unmatched data (measured on different cells) require methods like annotated group matching and aligning spaces (Miao et al., 2021). Despite these advancements, visualization methods for integrated multimodal single-cell data are still underdeveloped, and future challenges include accounting for modality-specific noise and improving computing efficiency (Miao et al., 2021). 7.3 Ethical and regulatory considerations The use of big data in health research introduces novel ethical and regulatory challenges that must be carefully considered. The aggregation and analysis of large-scale, heterogeneous data sources can lead to significant preventive, diagnostic, and therapeutic benefits. However, the methodological novelty and computational complexity of big data health research raise unique challenges for Ethics Review Committees (ERCs) and institutional review boards (Ienca et al., 2018). These challenges include ensuring data privacy, managing sensitive personal health data, and addressing power dynamics in the doctor-patient relationship (Galetsi et al., 2019). ERCs must adapt their evaluation criteria to assess the methodological and ethical viability of health-related big data studies, ensuring that the benefits of big data analytics are realized without compromising ethical standards (Ienca et al., 2018). Future research should focus on developing standardized systems for securely extracting and processing private healthcare datasets to mitigate these ethical and regulatory concerns (Galetsi et al., 2019). Acknowledgments We sincerely thank the two anonymous reviewers for their valuable opinions and suggestions, and thank Ms. W. Huang from our research team for organizing the materials for this study. Conflict of Interest Disclosure The authors affirm that this research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest. References Almasoud A., Al-Khalifa H., Al-Salman A.M., and Lytras M.L., 2020, A framework for enhancing big data integration in biological domain using distributed processing, Applied Sciences, 10(20): 7092. https://doi.org/10.3390/app10207092 Anslan S., Bahram M., Hiiesalu I., and Tedersoo L., 2017, PipeCraft: flexible open‐source toolkit for bioinformatics analysis of custom high‐throughput amplicon sequencing data, Molecular Ecology Resources, 17: e234-e240. https://doi.org/10.1111/1755-0998.12692 Bauch A., Adamczyk I., Buczek P., Elmer F., Enimanev K., Glyzewski P., Kohler M., Pylak T., Quandt A., Ramakrishnan C., Beisel C., Malmström L., Aebersold R., and Rinn B., 2011, openBIS: a flexible framework for managing and analyzing complex data in biology research, BMC Bioinformatics, 12: 468-468. https://doi.org/10.1186/1471-2105-12-468 Bohár B., Fazekas D., Madgwick M., Csabai L., Olbei M., Korcsmáros T., and Szalay-Beko M., 2022, Sherlock: an open-source data platform to store, analyze and integrate big data for computational biologists, F1000Research, 10: 409. https://doi.org/10.12688/f1000research.52791.2 Chen C.J., Chen H.R., Zhang Y., Thomas H., Frank M.H., He Y.H., and Xia R., 2020, TBtools-an integrative toolkit developed for interactive analyses of big biological data, Molecular Plant, 13(8): 1194-1202. https://doi.org/10.1016/j.molp.2020.06.009 Clarke R., Ressom H.W., Wang A., Xuan J.H., Liu M.C., Gehan E.A., and Wang Y., 2008, The properties of high-dimensional data spaces: implications for exploring gene and protein expression data, Nature Reviews Cancer, 8(1): 37-49. https://doi.org/10.1038/nrc2294 Davis-Turak J., Courtney S., Hazard E., Glen W., Silveira W., Wesselman T., Harbin L., Wolf B., Chung D., and Hardiman G., 2017, Genomics pipelines and data integration: challenges and opportunities in the research setting, Expert Review of Molecular Diagnostics, 17: 225-237. https://doi.org/10.1080/14737159.2017.1282822

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 104 Ebrahim A., Brunk E., Tan J., O'Brien E.J., Kim D., Szubin R., Lerman J., Lechner A., Sastry A., Bordbar A., Feist A., and Palsson B., 2016, Multi-omic data integration enables discovery of hidden biological regularities, Nature Communications, 7(1): 13091. https://doi.org/10.1038/ncomms13091 Galetsi P., Katsaliaki K., and Kumar S., 2019, Values, challenges and future directions of big data analytics in healthcare: a systematic review, Social Science & Medicine, 241: 112533. https://doi.org/10.1016/j.socscimed.2019.112533 Gligorijević V., and Przulj N., 2015, Methods for biological data integration: perspectives and challenges, Journal of The Royal Society Interface, 12(112): 20150571. https://doi.org/10.1098/rsif.2015.0571 Goh W.W.B., and Wong L., 2020, The birth of bio-data science: trends, expectations, and applications, Genomics, Proteomics & Bioinformatics, 18(1): 5-15. https://doi.org/10.1016/j.gpb.2020.01.002 Greene C.S., Tan J.H., Ung M., Moore J., and Cheng C., 2014, Big Data bioinformatics, Journal of Cellular Physiology, 229(12): 1896-1900. https://doi.org/10.1002/jcp.24662 Gutierrez D., Gant-Branum R., Romer C., Farrow M., Allen J., Dahal N., Nei Y., Codreanu S., Jordan A., Palmer L., Sherrod S., McLean J., Skaar E., Norris J., and Caprioli R., 2018, An integrated, high-throughput strategy for multiomic systems level analysis, Journal of Proteome Research, 17(10): 3396-3408. https://doi.org/10.1021/acs.jproteome.8b00302 He K.Y., Ge D., and He M.M., 2017, Big data analytics for genomic medicine, International Journal of Molecular Sciences, 18(2): 412. https://doi.org/10.3390/ijms18020412 Hien L., Van N., Oanh K.T.P., Ton N.D., Hue H.T.T., Duong N.T., Hằng P.L.B., and Ha N.H., 2021, Genomics and big data: research, development and applications, Vietnam Journal of Biotechnology. 19(3): 393-410. https://doi.org/10.15625/1811-4989/16158 Ienca M., Ferretti A., Hurst S., Puhan M., Lovis C., and Vayena E., 2018, Considerations for ethics review of big data health research: a scoping review, PLoS ONE, 13(10): e0204937. https://doi.org/10.1371/journal.pone.0204937 Jan B., Farman H., Khan M., Imran M., Islam I., Ahmad A., Ali S., and Jeon G., 2017, Deep learning in big data analytics: a comparative study, Comput. Electr. Eng., 75: 275-287. https://doi.org/10.1016/J.COMPELECENG.2017.12.009 JaseenaK U., and Kovoor B., 2018, A survey on deep learning techniques for big data in biometrics, International Journal of Advanced Research in Computer Science, 9: 12-17. https://doi.org/10.26483/IJARCS.V9I1.5136 Jin S.T., Zeng X.X., Xia F., Huang W., and Liu X.R., 2020, Application of deep learning methods in biological networks, Briefings in Bioinformatics, 22(2): 1902-1917. https://doi.org/10.1093/bib/bbaa043 Juan H.F., and Huang H.C., 2023, Quantitative analysis of high‐throughput biological data, Wiley Interdisciplinary Reviews: Computational Molecular Science, 13(4): e1658. https://doi.org/10.1002/wcms.1658 Kamal M., Ripon S., Dey N., Ashour A., and Santhi V., 2016, A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset, Computer Methods and Programs in Biomedicine, 131: 191-206. https://doi.org/10.1016/j.cmpb.2016.04.005 Kashyap H., Ahmed H.A., Hoque N., Roy S., and Bhattacharyya D.K., 2015, Big data analytics in bioinformatics: a machine learning perspective, Arxiv preprint arxiv, 2015: 1506.05101. Kumar D., Bansal G., Narang A., Basak T., Abbas T., and Dash D., 2016, Integrating transcriptome and proteome profiling: strategies and applications, Proteomics, 16(19): 2533-2544. https://doi.org/10.1002/pmic.201600140 Li Y., and Chen L., 2014, Big biological data: challenges and opportunities, Genomics, Proteomics & Bioinformatics, 12(5): 187-189. https://doi.org/10.1016/j.gpb.2014.10.001 Libbrecht M., and Noble W., 2015, Machine learning applications in genetics and genomics, Nature Reviews Genetics, 16: 321-332. https://doi.org/10.1038/nrg3920 Mardis E., 2016, The challenges of big data, Disease Models & Mechanisms, 1(2): 293-314. https://doi.org/10.1242/dmm.025585 Miao Z., Humphreys B., McMahon A., and Kim J., 2021, Multi-omics integration in the age of million single-cell data, Nature Reviews Nephrology, 17: 710-724. https://doi.org/10.1038/s41581-021-00463-x Najafabadi M.M., Villanustre F., Khoshgoftaar T.M., Seliya N., Wald R., and Muharemagic E., 2015, Deep learning applications and challenges in big data analytics, Journal of Big Data, 2: 1-21. https://doi.org/10.1186/s40537-014-0007-7 O'Driscoll A., Daugelaite J., and Sleator R., 2013, 'Big data', hadoop and cloud computing in genomics, Journal of Biomedical Informatics, 46(5): 774-781. https://doi.org/10.1016/j.jbi.2013.07.001

Computational Molecular Biology 2024, Vol.14, No.3, 97-105 http://bioscipublisher.com/index.php/cmb 105 Perez-Riverol Y., and Moreno P., 2019, Scalable data analysis in proteomics and metabolomics using biocontainers and workflows engines, Proteomics, 20(9): 1900147. https://doi.org/10.1002/pmic.201900147 Rahman A., 2019, Statistics-based data preprocessing methods and machine learning algorithms for big data analysis, International Journal of Artificial Intelligence, 17: 44-65. Schulz W.L., Durant T.J.S., Siddon A.J., and Torres R., 2016, Use of application containers and workflows for genomic data analysis, Journal of Pathology Informatics, 7(1): 53. https://doi.org/10.4103/2153-3539.197197 Shukla R., Yadav A.K., and Singh T.R., 2021, Application of deep learning in biological big data analysis, Large-Scale Data Streaming, Processing, and Blockchain Security, 2024: 225-250. https://doi.org/10.4018/978-1-7998-3444-1.ch006 Tariq M., Haseeb M., Aledhari M., Razzak R., Parizi R., and Saeed F., 2020, Methods for proteogenomics data analysis, challenges, and scalability bottlenecks: a survey, IEEE Access: Practical Innovations, Open Solutions, 9: 5497-5516. https://doi.org/10.1109/ACCESS.2020.3047588 Tonidandel S., King E., and Cortina J., 2018, Big data methods, Organizational Research Methods, 21: 525-547. https://doi.org/10.1177/1094428116677299 Wanichthanarak K., Fan S., Grapov D., Barupal D., and Fiehn O., 2017, Metabox: a toolbox for metabolomic data analysis, interpretation and integrative exploration, PLoS ONE, 12(1): e0171046. https://doi.org/10.1371/journal.pone.0171046 Wu X., Zhu X., Wu G., and Ding W., 2014, Data mining with big data, IEEE Transactions on Knowledge and Data Engineering, 26: 97-107. https://doi.org/10.1109/TKDE.2013.109 Xia J., Wang J., and Niu S., 2020, Research challenges and opportunities for using big data in global change biology, Global Change Biology, 26: 6040-6061. https://doi.org/10.1111/gcb.15317 Yang A., Troup M., and Ho J., 2017, Scalability and validation of big data bioinformatics software, Computational and Structural Biotechnology Journal, 15: 379-386. https://doi.org/10.1016/j.csbj.2017.07.002 Yeh C.W., Huang C.W., Yang C.L., and Wang Y.T., 2023, A high performance computing platform for big biological data analysis, 2023 9th International Conference on Applied System Innovation (ICASI), 2023: 68-70. https://doi.org/10.1109/ICASI57738.2023.10179527 Yin Z., Lan H., Tan G., Lu M., Vasilakos A., and Liu W., 2017, Computing platforms for big biological data analytics: perspectives and challenges, Computational and Structural Biotechnology Journal, 15: 403-411. https://doi.org/10.1016/j.csbj.2017.07.004 Younas M., 2019, Research challenges of big data, Service Oriented Computing and Applications, 13: 105-107. https://doi.org/10.1007/s11761-019-00265-x

Computational Molecular Biology 2024, Vol.14, No.3, 106-114 http://bioscipublisher.com/index.php/cmb 106 Research Insight Open Access AI in Biology: Transforming Genomic Research with Machine Learning Qiang Zhang, Yu Wang Biotechnology Research Center, Cuixi Academy of Biotechnology, Zhuji, 311800, Zhejiang, China Corresponding author: yu.wang@cuixi.org Computational Molecular Biology, 2024, Vol.14, No.3 doi: 10.5376/cmb.2024.14.0013 Received: 08 Apr., 2024 Accepted: 23 May, 2024 Published: 10 Jun., 2024 Copyright © 2024 Zhang and Wang, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Preferred citation for this article: Zhang Q., and Wang Y., 2024, AI in biology: transforming genomic research with machine learning, Computational Molecular Biology, 14(3): 106-114 (doi: 10.5376/cmb.2024.14.0013) Abstract With the rapid development of artificial intelligence (AI) and machine learning (ML) technologies, the field of biology, particularly genomic research, is undergoing profound transformations. This study explores how AI and ML are redefining genomic data analysis and functional genomics research, while emphasizing the critical role these technologies play in enhancing research efficiency, improving accuracy, and advancing personalized medicine. The application of AI in biology has expanded from basic data processing to complex tasks such as gene function prediction, identification of regulatory elements, and understanding epigenetic modifications. Through an in-depth analysis of key machine learning techniques, including supervised learning, unsupervised learning, and deep learning, this study demonstrates how these methods are revolutionizing traditional genomic data analysis workflows, significantly improving the efficiency of sequence alignment, variant calling, and gene expression profiling. Additionally, it discusses the future prospects of AI-driven genomic tools, cloud computing, big data integration, and open-source platform collaboration, aiming to provide valuable insights for future research and technological development. Keywords Artificial intelligence (AI); Machine learning (ML); Genomic research; Functional genomics; Personalized medicine 1 Introduction Artificial Intelligence (AI) and Machine Learning (ML) have revolutionized numerous fields, and biology is no exception. The advent of high-throughput technologies has led to an explosion of biological data, necessitating advanced computational methods to analyze and interpret these vast datasets. Machine learning, which involves developing algorithms that improve through experience, has shown immense potential in handling complex biological data. Techniques such as supervised, semi-supervised, and unsupervised learning, as well as deep learning, are being increasingly applied to genomic data to uncover hidden patterns and make accurate predictions (Angermueller et al., 2016). These methods have been particularly effective in tasks such as annotating sequence elements, predicting gene expression levels, and identifying genomic elements like promoters and enhancers (Wu and Zhao, 2019; Liu et al., 2020). Genomic research is pivotal in understanding the fundamental mechanisms of life and disease. By studying the genome, researchers can identify genetic variations that contribute to diseases, understand gene function, and develop targeted therapies. The ability to analyze large-scale genomic data has opened new avenues in precision medicine, where treatments can be tailored to an individual's genetic makeup (Koumakis et al., 2020). The integration of machine learning in genomic research has further accelerated discoveries, enabling the modeling of complex biological networks and the prediction of disease risks based on genetic information (Leung et al., 2016; Camacho et al., 2018). This study will provide a comprehensive overview of the current applications of artificial intelligence and machine learning in genomics research, explore various machine learning techniques and their practical applications in genomics, discuss the challenges and limitations of these methods, and emphasize the future development directions in this field. I hope to clarify the transformative impact of machine learning on genomic research and its potential to further advance biology and medicine. 2 Overview of Machine Learning Techniques in Genomics 2.1 Supervised learning Supervised learning involves training a model on a labeled dataset, where the input data is paired with the correct

Computational Molecular Biology 2024, Vol.14, No.3, 106-114 http://bioscipublisher.com/index.php/cmb 107 output. This approach is widely used in genomics for tasks such as variant calling, gene expression prediction, and classification of genomic sequences. For instance, supervised learning techniques have been applied to annotate sequence elements and analyze epigenetic, proteomic, and metabolomic data (Libbrecht and Noble, 2015). These methods are particularly effective in scenarios where a large amount of labeled data is available, allowing the model to learn the mapping from inputs to outputs accurately. 2.2 Unsupervised learning Unsupervised learning techniques are used to identify patterns and structures in data without the need for labeled outputs. In genomics, these methods are often employed for clustering and dimensionality reduction tasks. Clustering approaches, such as hierarchical, centroid-based, and density-based methods, help in understanding the natural structure inherent in genomic data, such as gene expression profiles and cellular processes (Figure 1) (Karim et al., 2020). Unsupervised learning is crucial for exploratory data analysis, where the goal is to uncover hidden patterns and relationships within the data. This image illustrates the use of a convolutional autoencoder for unsupervised learning to perform clustering analysis on microscope images. Clustering analysis is conducted after image processing, utilizing clustering algorithms such as K-means to group the feature space. This approach helps uncover hidden patterns and relationships within the data, such as distinct clusters of gene expression or differences between cell types. To enhance clustering performance, the network jointly optimizes both the reconstruction loss and the Cluster Assignment Hardening (CAH) loss, refining the clustering results by continuously adjusting the network parameters. This application of unsupervised learning in genomics is particularly suited for exploratory data analysis. Through clustering methods, it can help us understand the intrinsic natural structure of genomic data, thereby revealing hidden patterns in gene function and cellular processes. Figure 1 Schematic representation of a VAE used for clustering GE data (Adopted from Karim et al., 2020) 2.3 Deep learning and neural networks Deep learning, a subset of machine learning, has revolutionized the analysis of genomic data by leveraging multilayered artificial neural networks (ANNs) to model complex patterns. Deep learning techniques, including convolutional neural networks (CNNs) and deep neural networks (DNNs), have shown remarkable success in various genomic applications. These methods are particularly adept at handling high-dimensional data and have been used to predict the structure and function of genomic elements, such as promoters and enhancers (Li, 2018; Liu et al., 2020; Schmidt and Hildebrandt, 2020). Deep learning models have also been applied to next-generation sequencing (NGS) data for tasks such as variant calling, metagenomic classification, and genomic feature detection. Despite their success, one of the challenges with deep learning models is their interpretability. Efforts are being made to develop methods for interpreting the predictions of DNNs to better understand the underlying molecular and cellular mechanisms (Talukder et al., 2020).

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==