Bt Research 2025, Vol.16, No.4, 136-146 http://microbescipublisher.com/index.php/bt 137 2 Characteristics and Data Sources of Bt genome 2.1 Structural and functional characteristics of Bt genome Bt strains usually have larger circular chromosomes (about 5 Mb~6 Mb) and multiple plasmids of varying sizes, forming their unique genomic structure. The chromosomes carry basic essential genes and many virulence-related genes, such as exosporin enzymes, toxin regulators, etc., plasmids are often the key to Bt insecticidal function, and many cry and vip insecticidal toxin genes are located on the plasmid (Cheliah et al., 2019). Another feature of the Bt genome is the presence of a large number of insertion sequences (IS elements) and mild bacteriophages. These mobile genetic elements promote the recombination and evolution of the genome, making the genetic composition of different Bt strains vary greatly, and are considered to have the "open genome" feature. A significant difference in Bt relative to other nearby Bacillus (such as Bacillus cereus, Bacillus anthrax) is that many of its strains lack a functional CRISPR-Cas system, which may reduce its ability to limit exogenous DNA, thereby obtaining diverse plasmids and virulence genes through horizontal gene transfer. 2.2 Ways to obtain public databases and Bt genomic data With the development of second-generation and third-generation sequencing technologies, more and more genomic sequences of Bt strains have been determined and published in major public databases. Researchers can find sequencing project data for Bt strains through NCBI's BioProject and BioSample, and obtain original sequencing reads from the SRA database for self-assembly or analysis. In addition to NCBI, the European nucleic acid databases ENA and DNA Data Bank of Japan (DDBJ) also implement data sharing with NCBI. Domestic, the National Genome Science Data Center (NGDC) has established databases such as Genome Warehouse, and also collected some of the genomic data of Bt strains in China. For Bt important insecticidal virulence protein genes, special databases provide information retrieval and analysis. For example, the Bacterial Insecticidal Protein Resource Center (BPPRC) database is a network platform that collects known Bt insecticidal protein sequences and provides analysis tools (Panneerselvam et al., 2022), which facilitates researchers to query the sequence, multiple alignment and classification information of specific toxin proteins such as Cry and Vip. 2.3 Application of high-throughput sequencing technology in Bt genome research High-throughput sequencing technology is the basic means of Bt genome research. Most early Bt genome studies used second-generation sequencing to obtain short read and long sequencing data, and obtained genome sketches through de novo assembly. The advantages of short read-long sequencing are high accuracy, large throughput and low cost. However, due to limited read-length, the assembly results are usually multiple discontinuous scaffolds, making it difficult to accurately analyze the structure of complex repeat regions or large plasmids. In recent years, third-generation sequencing has been widely used in Bt genome sequencing. The third generation sequencing read length can reach tens of kb, and can cross repeated sequences to obtain a complete assembled sequence containing chromosomes and plasmids. However, the single base error rate of third-generation sequencing is relatively high, and is often used in combination with second-generation data to correct errors and improve assembly accuracy (Yılmaz et al., 2022). In practical research, a common strategy is to mix and assemble Illumina high-quality short-read long data with Nanopore/PacBio long-read long data, which is the so-called hybrid assembly, to take into account assembly integrity and accuracy. For example, a Bt industrial strain was sequenced simultaneously with MGI (second generation) and ONT (nanopores), and after mixing and assembly, the closed genome sequence of the strain was successfully obtained. Another study used only Illumina sequencing data to de novo assembly of Bt strains, resulting in a genome sketch of about 6.5 Mb, in which three toxin gene loci and multiple secondary metabolite synthetic gene clusters were annotated (Williams and MacLea, 2020). 3 Genome Assembly and Annotation Tools 3.1 Commonly used genome assembly software Bt genome assembly usually uses common second- or third-generation assembly software to splice sequencing reads into genome sequences. For Illumina short read long data, commonly used assembly tools include SPAdes, Velvet, SOAPdenovo, etc. These de novo assembly software use the De Bruijn graph algorithm to efficiently splice reads and have been successfully applied to sketch genome construction of multiple Bt strains. For example,
RkJQdWJsaXNoZXIy MjQ4ODYzNA==