CMB_2024v14n4

Computational Molecular Biology 2024, Vol.14, No.4, 173-181 http://bioscipublisher.com/index.php/cmb 174 2 Evolution of Bioinformatics in the Big Data Era 2.1 Historical background and early bioinformatics tools The origins of bioinformatics can be traced back over 50 years, long before the advent of next-generation sequencing technologies. The field began in the early 1960s with the application of computational methods to protein sequence analysis, including de novo sequence assembly and the creation of biological sequence databases. Early bioinformatics tools were developed to handle the increasing amount of biological data generated by molecular biology methods and the miniaturization of computers. These foundational tools laid the groundwork for the integration of computational techniques into biological research, enabling the systematic organization, analysis, and understanding of biological data (Gauthier et al., 2018; Shoaib et al., 2021). 2.2 Shift towards high-throughput data analysis The rapid development of high-throughput sequencing (HTS) techniques has significantly transformed bioinformatics, ushering in the era of big data in biology. High-throughput technologies have expanded the availability and quantity of molecular data, necessitating the development of new computational tools for data analysis. The emergence of next-generation sequencing programs has led to unparalleled growth in whole-genome sequencing projects, such as the 100 000 human genomes and 1 000 plant species initiatives. This shift towards high-throughput data analysis has also seen the rise of deep learning and machine learning methodologies, which are now commonly used to identify patterns, make predictions, and model biological processes (Koumakis, 2020). Tools like TBtools have been developed to provide user-friendly interfaces for wet-lab biologists, facilitating the analysis of large-scale datasets (Figure 1) (Chen et al., 2020). The graphical features of TBtools, combined with large-scale data generated by HTS (High-Throughput Sequencing) technologies, have greatly enhanced the efficiency of biological research, enabling biologists to better understand complex genomic structures and functional patterns. This also reflects the growing importance of methods such as machine learning and deep learning in recognizing patterns, making predictions, and simulating biological processes as biological data continues to increase. 2.3 Growth of public biological databases The exponential growth of biological data has necessitated the creation and expansion of public biological databases. Institutions like the European Bioinformatics Institute (EMBL-EBI) have played a crucial role in maintaining comprehensive data resources, which stored over 390 petabytes of raw data by the end of 2020 (Khan et al., 2022). Databases such as KEGG have become essential for the biological interpretation of genome sequences and other high-throughput data, providing practical value for researchers. The integration of digital information, biological data, electronic medical records, and clinical information has created a tsunami of opportunities for knowledge discovery, emphasizing the need for open data sources, open access to software, and the implementation of machine learning and artificial intelligence. The continuous development and maintenance of these databases are critical for supporting the ever-growing demands of bioinformatics research and applications (Solanki et al., 2020). 3 Key Computational Tools for Big Data in Biology 3.1 Machine learning algorithms in bioinformatics 3.1.1 Application in gene expression and regulation studies Machine learning algorithms have become indispensable in the analysis of gene expression and regulation. These algorithms facilitate the automatic extraction and selection of features from large datasets, enabling the generation of predictive models that can efficiently study complex biological systems. For instance, machine learning techniques are integrated with bioinformatics methods to enhance training and validation processes, identify interpretable features, and investigate models (Auslander et al., 2021). Probabilistic graphical models have been employed to reconstruct gene regulatory networks from transcriptomics and genomics data, providing a concise representation of complex gene regulatory relationships (Cheng, 2020).

RkJQdWJsaXNoZXIy MjQ4ODYzNA==