CMB_2025v15n4

Computational Molecular Biology 2025, Vol.15, No.4, 208-217 http://bioscipublisher.com/index.php/cmb 20 9 key technologies and practical strategies in the standardization process, and combines the latest guidelines and consensuses at home and abroad to propose feasible optimization ideas, hoping to provide reference for improving the consistency and reliability of clinical sequencing analysis. 2 An Overview of the Clinical Genomics Bioinformatics Analysis Process 2.1 Quality control and preprocessing of raw data The first step of bioinformatics analysis usually begins with the FASTQ files output by the sequencer, which record the base sequence and quality value of each read length. The original data may seem complete, but not all of it can be directly used for downstream analysis. During sequencing, contamination from adapters, base‐calling errors, or “noise” reads are inevitable; if left unfiltered, these artifacts may distort the subsequent analyses (Hao et al., 2022; Chen, 2023). Therefore, researchers usually perform quality control first: removing reads with excessive “N” bases, reads with a high proportion of low-quality bases, or directly trimming contaminated sections. Tools like FastQC can quickly visualize quality distributions and trigger filtering/trimming actions (Hao et al., 2022). These cleaned high‐quality reads serve as the foundation for reliable variant detection downstream. 2.2 Sequence alignment Next comes sequence alignment. The quality‐controlled reads must find their best match positions in the reference genome. While it may sound straightforward, alignment is one of the most critical and algorithmically challenging steps in the whole pipeline. Well-known aligners such as BWA-MEM, Bowtie2 and NOVOAlign can all perform the task, and among them BWA-MEM remains widespread due to its balance of speed and accuracy (Alganmi et al., 2023). It uses the Burrows-Wheeler transform to enable efficient mapping. The alignment output is typically stored in SAM/BAM format, detailing each read’s genomic location and match status. To reduce systematic bias between batches, many clinical labs adopt the same version of the reference genome and fix alignment parameters. 2.3 Variant detection Only after alignment are researchers in a position to detect variants. Variant callers infer differences between the sample genome and the reference by analyzing mismatches, insertions, and deletions in the alignment data. Tools like GATK HaplotypeCaller, FreeBayes and SAMtools/BCFtools are commonly used; among them, GATK’s algorithm-employing local re‐assembly and Bayesian modeling-is considered a “gold standard” for small variants (Wilton and Szalay, 2023). However, for more complex structural variants (SV) or copy number variants (CNV), these tools alone are insufficient: specialized callers such as Manta, CNVnator or BreakDancer are often required (Minoche et al., 2021). Different algorithms have varying sensitivity, so in practice multiple tools are often used in cross‐validation. Ultimately, variant information is saved in a VCF (Variant Call Format) file-listing each variant’s coordinate, type, allele frequency and more-a format widely supported across tools and databases. 3 The Necessity of Standardizing Bioinformatics Processes 3.1 Enhance the accuracy and repeatability of the results The reason why standardization was first proposed in clinical genomic analysis is actually to make the results more accurate and repeatable. False negatives or false positives brought by NGS analysis, once they occur in the diagnostic stage, may directly affect the treatment choices of patients (Roy et al., 2018). However, in reality, the analytical habits of different laboratories vary greatly. The software versions and parameter settings are all different, and the results of the same sample data running in different places may not be consistent. Studies have pointed out that in multi-laboratory comparisons, the SNV detection results of the same sample by different pipelines are only 50% to 60% consistent, and the consistency rate of Indel is even lower (Samarakoon et al., 2025). This difference is not only caused by the algorithm itself, but also by multiple details such as alignment methods, filtering thresholds, and judgment criteria. If the entire industry can adopt a unified “best practice” process, these human differences will be significantly reduced, so that the conclusions of the same data can remain consistent regardless of which laboratory is analyzed (Koboldt, 2020). On the other hand, standardization can also enhance the stability within the laboratory. Once a validated process is fixed, the results

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==