CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 151-159 http://bioscipublisher.com/index.php/cmb 155 with parallel file systems or object storage to facilitate data transmission. In addition to features such as automatic scaling and serverless orchestration, bioinformatics pipelines can freely switch between the cloud and HPC. It can be said that hybrid high-performance computing is no longer just an experimental attempt, but is gradually becoming the norm in large-scale genomic research. 6 Performance Evaluation and Benchmarking 6.1 HPC process evaluation indicators When evaluating high-performance computing pipelines, what people care about is not just "whether they run fast or not". Speed is of course important, but metrics such as throughput, scalability and cost often need to be considered together. Throughput refers to the number of samples that can be processed within a certain period of time, while scalability reflects the extent of performance improvement after adding resources. The problem is that resources double but the speed does not necessarily double accordingly. This situation is quite common in tests (Carrier et al., 2015). Ahmed et al. Compared the performance of different workflow management systems in the mutation detection task, and the results showed that there were significant differences among the systems in terms of scalability and stability. When it comes to cost, sometimes a lot of computing power and time are spent, but the performance improvement is not proportional. At this point, it is necessary to consider whether it is cost-effective. In an HPC environment, people will also look at detailed indicators such as scheduling efficiency, resource occupancy rate, and I/O bandwidth. When it comes to the cloud, the accounts become even more complicated - apart from computing power, the rental duration, data transmission and storage costs also need to be taken into account. Only by comparing all these dimensions together can we truly see which solution is "more worthwhile", rather than simply focusing on the speed. 6.2 Benchmark datasets and standard test frameworks When it comes to testing, the approaches of different teams vary greatly, but the most commonly used ones are still those reference samples with "true value". The "Genome in a Bottle" (GIAB) project is a recognized standard that is most suitable for evaluating the accuracy of SNP and Indel detection. NA12878 is one of the most beloved samples and has almost become a "touchstone" for detection accuracy. In addition, the data from the Illumina Platinum Genome and 1000 Genome projects are also often used as controls. Some researchers also use simulated data or manually sliced datasets to test the performance of the algorithm at different scales. The key is not where the data comes from, but whether the output results can be compared with the standard true value to calculate the missed detection rate and false detection rate. Only in this way can we clearly see where the shortcomings of the algorithm lie. 6.3 Comparative research process of HPC empowerment and traditional methods When it comes to the advantages of HPC, many comparative experimental results are actually quite intuitive. For instance, in the HPC-GVCW experiment by Zhou et al. (2021) (Figure 1), they conducted SNP detection on 25 rice genomes simultaneously in just 120 hours, while the traditional process would take half a year. The GPU-accelerated NGS pipeline has also undergone similar benchmark tests, and its speed is 65 times faster than that of CPU tools. The research results further indicate that both FPGA and GPU far exceed traditional CPU in terms of speed and scalability. Although hardware investment is indeed not cheap, the trend is already quite clear - high-performance computing is turning mutation detection from a "months-long project" into a "hours-long task". This performance gap has also made HPC an almost indispensable core technology in large-scale genomic research. 7 Case Study 7.1 Description of computing infrastructure and datasets In this case, we did not build the computing environment from scratch but directly used two mature HPC platforms. The data was selected from the high-coverage samples of the "Thousand Genomes Project", totaling 98 individuals. The depth of the whole genome sequencing was approximately 30×, and after compression, it was about 632GB. The CloudLab cluster configuration is relatively higher, with 8 nodes. Each node is equipped with 40-core CPU, 192GB memory and NVIDIA Tesla P100 GPU. Although the Fabric environment also has 8 nodes,

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==