LGG_2026v17n1

Legume Genomics and Genetics 2026, Vol.17, No.1, 49-67 http://cropscipublisher.com/index.php/lgg 53 3 Methods for Analyzing Soybean Genomic Diversity Based on SNP Markers 3.1 Development and screening of SNP markers High-quality SNP marker sets were developed from whole-genome resequencing or genotyping-by-sequencing (GBS) of large soybean panels spanning wild and cultivated germplasm. Resequencing thousands of accessions enables discovery of millions of raw SNPs, which are then filtered for read depth, base quality, biallelic status, and low missing data to obtain a robust variant catalogue distributed across all 20 chromosomes (Niu et al., 2024; Valliyodan et al., 2021). Targeted genotyping arrays, such as the Axiom SoyaSNP array with ~180K SNPs and the widely used SoySNP50K platform, were designed from these catalogs by prioritizing markers in gene-rich regions, evenly spaced along chromosomes, and with intermediate minor allele frequencies (MAF) to maximize information content in diversity analyses (Lee et al., 2015; Valliyodan et al., 2021). More recently, nested SNP assay series (SoySNP50K/6K/3K/1K) and reduced GBTS panels (40K/20K/10K) were assembled as subsets of high-density arrays, allowing researchers to match marker density and cost to specific germplasm characterization or breeding applications while retaining compatibility with legacy data sets (Song et al., 2024). For global germplasm diversity studies, additional criteria were applied to ensure that the SNP panel discriminates both between wild and cultivated soybean and within each group. Panels were refined by removing monomorphic loci, markers with high missing data, and those with extreme allele frequency skews, and by retaining sites showing differentiation between Glycine soja and Glycine max and among cultivated subgroups (Niu et al., 2024). Quality control steps typically included excluding accessions with excessive missing data, applying MAF thresholds (e.g. ≥0.05), and checking marker performance through concordance with resequencing genotypes or replicate assays (Chander et al., 2021). The resulting datasets often contain tens of thousands of high-quality SNPs with low error rates and broad genomic coverage, suitable for downstream estimation of genomic diversity, identification of large-effect variants, and construction of mutant gene libraries linked to agronomic traits (Niu et al., 2024). 3.2 Metrics for assessing genomic diversity Genomic diversity based on SNP data was quantified using standard population-genetic metrics computed at both locus and genome levels. Per-marker statistics included polymorphic information content (PIC), gene diversity (expected heterozygosity), observed heterozygosity, major allele frequency, and MAF, which together describe the informativeness and allele frequency spectrum of the SNP set (Chander et al., 2021). In soybean panels genotyped with high-throughput SNP arrays, average gene diversity values around 0.41–0.42 and PIC values near 0.32–0.33 have been reported, with a substantial proportion of markers exhibiting PIC >0.35, indicating adequate polymorphism despite the bi-allelic nature of SNPs and the relatively narrow genetic base of cultivated soybean (Abebe et al., 2021). Shannon’s diversity index and unbiased diversity estimates were also used at the population level to compare diversity among geographic or breeding groups (Shaibu et al., 2021). To assess differentiation and structure across global germplasm, fixation indices (F_ST) and analysis of molecular variance (AMOVA) partitioned genetic variation within and among predefined groups. These analyses consistently showed that the majority of variation (often >90%) resides within populations, with a smaller but significant fraction among regions or breeding pools (Rani et al., 2023). Pairwise genetic distances, Nei’s diversity, and clustering-based measures summarized relationships among accessions and populations, while indices such as the number of private alleles provided additional insight into unique diversity that may be under-utilized in breeding (Figure 2) (Abebe et al., 2021). In large resequenced panels, linkage disequilibrium (LD) decay and the distribution of conserved versus highly polymorphic genomic regions were further examined to infer domestication bottlenecks, selection sweeps, and the effective resolution for association mapping and haplotype-based analyses of soybean diversity (Valliyodan et al., 2021).

RkJQdWJsaXNoZXIy MjQ4ODYzNA==