CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 151-159 http://bioscipublisher.com/index.php/cmb 151 Review Article Open Access High-Performance Computing Pipelines for NGS Variant Calling Wenzhong Huang Biomass Research Center, Hainan Institute of Tropical Agricultural Resouces, Sanya, 572025, Hainan, China Corresponding author: wenzhong.huang@hitar.org Computational Molecular Biology, 2025, Vol.15, No.3 doi: 10.5376/cmb.2025.15.0015 Received: 18 Apr., 2025 Accepted: 29 May, 2025 Published: 21 Jun., 2025 Copyright © 2025 Huang, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.6 Preferred citation for this article: Huang W.Z., 2025, High-performance computing pipelines for NGS variant calling, Computational Molecular Biology, 15(3): 151-159 (doi: 10.5376/cmb.2025.15.0015) Abstract With the popularization of high-throughput sequencing (NGS) technology, genomic sequencing data have grown exponentially, posing severe computational challenges for variant detection. Traditional mutation detection processes (such as GATK-based pipelines) are prone to computational bottlenecks and I/O bottlenecks when dealing with large-scale data. This paper reviews the high-performance computing (HPC) processes for NGS mutation detection, introduces the typical workflows and commonly used algorithms of NGS mutation detection, and analyzes the performance bottlenecks of traditional processes. Subsequently, the application of the architecture of HPC and the parallel computing model in bioinformatics was expounded. On this basis, the HPC optimization strategies for the mutation detection process were mainly discussed, including task parallelization, I/O optimization, data locality management, and the methods of workflow orchestration using middleware such as SLURM, Nextflow, and Cromwell. This paper introduces the application of emerging hardware acceleration technologies such as GPU and FPGA in mutation detection, discusses performance evaluation metrics and benchmark testing frameworks, as well as a comparative study of HPC-driven processes and traditional methods. Keywords High-performance computing; Mutation detection; Next-generation sequencing; Parallel computing; Workflow 1 Introduction With the decline in sequencing costs, NGS has become the main tool for studying genetic differences, and it is used in humans, animals and plants. It no longer relies on a small number of samples or lengthy experimental steps as it did in the past. Now, it can generate massive amounts of genomic data in a very short time. In this way, various changes - such as SNPS and indels - can be quickly captured. These changes are often related to disease risks or drug responses, which is precisely the key to precision medicine. The general process is actually quite fixed: first, quality control is carried out; then, sequencing data is aligned to the reference genome; next, variations are called up; and finally, annotations are made. Names like GATK, DeepVariant, and FreeBayes often appear in such processes. Although everyone is doing the same thing, their methods vary greatly - some are still using statistical models, while others have shifted to deep learning. The results are also quite interesting. For example, tools based on neural networks like DeepVariant tend to be more accurate than old-fashioned methods when detecting SNPS and indels (Pei et al., 2021). However, problems also arise: the volume of data is simply too large. The growth rate of NGS data is almost beyond imagination. Once it rises to the scale of tens of thousands of whole genomes, the computing demand snowballes. Ahmedet al. Even compared genomics with astronomy and physics, saying that current research has had to migrate from traditional HPC systems to the cloud in order to survive. After all, using GATK to handle SNP detection for 3,000 crop genomes might take half a year, which clearly no one can afford to wait. Zhou et al. (2023) also noticed this. They developed a new system called HPC-GVCW, which runs on the cluster and is much faster than the traditional process. So, it actually makes little sense to talk about whether high-performance computing is "needed" now - it is necessary. Distributing tasks across multiple nodes, processing them in parallel, or even leveraging hardware acceleration are all ways to make analysis faster. HPC is designed for speed and scale: as the number of nodes increases, performance can almost grow linearly. Although the cloud platform has different ideas, its purpose is similar - you don't need to worry about the hardware, and resources can be expanded as needed. No matter which

RkJQdWJsaXNoZXIy MjQ4ODYzNA==