CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 151-159 http://bioscipublisher.com/index.php/cmb 153 too large, the memory of a single node cannot handle it. Thus, distributed memory models emerged, such as MPI or Spark, which distribute tasks and data to run on multiple nodes. This model is particularly suitable for those extremely large problems that require extensive machine collaboration. There is also a compromise solution - a hybrid model, such as MPI+OpenMP, which combines the parallelism within and between nodes. Gpus and other accelerators are also playing an increasingly important role in this system. They can break tasks down into tens of thousands of small pieces for simultaneous processing. Take NGS analysis as an example. Chromosomes or samples can be allocated to different nodes, and each node can then accelerate internal computing through multi-threading or GPU. This hierarchical parallel approach not only leads to higher throughput, but also enables smoother scaling when the data scale expands. 3.3 High-performance computing environment In an HPC environment, who decides when and where jobs run? This depends on the resource management system and the workflow management system. WfMS such as Nextflow, Snakemake, and Cromwell (in conjunction with WDL) can break down complex tasks, automatically handle dependencies, and then hand them over to the scheduler for scheduling (Jha et al., 2022). In large-scale genomic projects, these systems can operate across clusters, grids, and the cloud, keeping cumbersome processes efficient and consistent. Nowadays, containerization and version control are also often integrated into WfMS, making it easier to reproduce scientific research work and facilitating migration to different computing platforms. 4 Optimization Strategies for the HPC Mutation Detection Process 4.1 Parallelization of workflow components and task decomposition When conducting mutation detection on an HPC system, if the task is not broken down first, it often leads to the predicament of "slow calculation and long waiting time". The HPC-GVCW pipeline is a typical example - instead of processing the entire genome as usual, they first used the "genome index splitter" to slice the reference genome into several large blocks, and each block was handed over to an independent GATK instance to run. In this way, different genomic regions and even different chromosomes can be analyzed simultaneously, and the speed increases immediately. Later, many studies also verified the effectiveness of this train of thought. The Humming Bird pipeline designed by Liu et al. (2020) adopts this decomposition strategy. The result is much faster than the traditional BWA-GATK without losing accuracy. Mulone et al. (2023) used Stream Flow to divide NGS tasks in hybrid cloud and HPC environments and also achieved nearly linear scaling effects. In fact, the principle is straightforward: those time-consuming stages such as alignment, marking repetition, and variant invocation, as long as they can be broken down into small tasks that can be executed independently, can be done in parallel, and the efficiency will naturally be high. Modern workflow management systems like Nextflow and Snakemake inherently support sample-level and chromosomal task parallelism. As long as data dependencies are sorted out, a considerable amount of overall time can be saved and the consistency of results can also be guaranteed. 4.2 I/O optimization and data locality management No matter how strong the computing power is, if I/O cannot keep up, the process still cannot run smoothly. The bottleneck often does not lie in the CPU, but in storage and network. When data traffic exceeds the limit of the shared file system, alignment and mutation calls are most likely to encounter problems. So the researchers began to think of ways to optimize from the file system and cache. Parallel file systems like Lustre and GPFS have begun to be widely adopted. Some people even directly use local SSDS of nodes for caching to minimize data transmission latency. The CloudGT framework (Xiao et al., 2018) is an example. It implements parallel multi-node execution with Apache Spark and stores data in parquet format. The I/O overhead is reduced by more than half and the speed is increased by 74.9%. Another study found that using local caching or in-memory transfer can significantly reduce intermediate file writes, alleviate disk contention, and make the cluster as a whole more efficient. Moreover, the old principle of "computing data nearby" still works well - task scheduling should be as close as possible to the data source, so that the network burden is much lighter. The Hadoop and Spark systems have long proved this point. Overall, only by integrating the file system, memory buffering and scheduling strategies can the HPC cluster maintain stable I/O performance in the parallel processing of tens of thousands of samples.

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==