CMB_2025v15n1

Computational Molecular Biology 2025, Vol.15, No.1, 1-12 http://bioscipublisher.com/index.php/cmb 3 around 1-2% for most RNA-Seq data, allowing it to accurately map reads even with minor sequencing errors present in the data. The transcript alignment bam file together with annotation GFF file were used as input for Cufflinks (v2.2.1) (Trapnell et al., 2010). The transcript GTF (Gene Transfer Format) files generated from each RNA-seq dataset after running Cufflinks were merged using cuffmerge script within the Cufflinks package for each project. The GTF file generated from merged RNA-seq GTF files in each RNA-seq project was further merged using Cuffcompare script. Astalavista was used for AS event classification (Foissac and Sammeth, 2007). AS events are generally classified as exon skipping (ExonS), alternative donor site (AltD), alternative accepter site (AltA), intron retention (IntronR) and complex event. The complex AS events were counted when two or more basic events occurred in comparing a pair of isoforms. RNA-seq data processing was carried out in our facility and in the Ohio Supercomputer Center. 2.3 Functional annotation of transcripts and data availability The transcript sequences were retrieved using gtf_to_fasta tool in the TopHat package based on the GTF file generated by Cuffcompare program after merging all GTF files. These transcripts were functionally annotated, including open reading frame (ORF) prediction, BLASTX against UniProt-Swiss-Prot database, protein family (Pfam), and comparison with reference gene models (Min et al., 2005a; Min et al., 2005b). The transcripts sequences, detailed information of AS events, and supplementary files are available at our bioinformatics site (http://bioinformatics.ysu.edu/publication/data/Aniger/). 3Results 3.1 Mapping mRNA assembled transcripts and RNA-seq data to the genome Beginning with a total of 78,361 mRNA sequences, after going through the cleaning procedure including trimming poly(A/T) ends and removing contaminants and repetitive sequences, we obtained 78,194 sequences that were further assembled into a non-redundant set of 23,853 transcripts for genome mapping. A total of 19,571 (82.0%) assembled transcripts were mapped to the reference genome. We mapped a total of 303 RNA-seq datasets to A. niger reference genome (Table 2). The accession numbers and detailed mapping information of these RNA-seq data can be found in a supplementary file (supplementary Table 1). The mapping rates varied from 70.2% to 90.0% with 0.4% to 4.0% reads being mapped to more than one location in the datasets collected from different projects. In total 12.7 billion reads were collected with 10.3 billion reads (~81.0%) being mapped to the genome. Among the mapped reads, 2.7% reads (~0.35 billion) were mapped to two or more genomic loci (Table 2). Table 2 Mapping summary of RNA-seq datasets obtained from different projects Data source Total reads Mapped reads MAreads * Mapped (%) MA(%) I 8 773 346 255 7 309 217 470 274072924 83.3 3.7 II 278954 306 207899991 8412714 74.5 4.0 III 185792 326 157657498 165537 84.9 0.1 IV 391730 912 352553672 9121779 90.0 2.6 V 2 042 840 126 1 501 446 874 38297236 73.5 2.6 VI 797080 340 559903468 14453144 70.2 2.6 VII 276043 016 234248824 902089 84.9 0.4 Total 12 745 787 281 10 322 927 797 345425423 81.0 2.7 Note: * MA reads: reads mapped to more than one genomic locus 3.2 Identification of AS events Mapping assembled mRNA transcripts, including ESTs, to the genome we identified a total of 2,098 AS events including 74 ExonS, 213 AltD, 397 AltA, 723 IntronR, and 691 complex events (Table 3). These AS events were generated from 1,804 genes involving 3,835 transcripts.

RkJQdWJsaXNoZXIy MjQ4ODYzNA==