Computational Molecular Biology 2017, Vol.7, No.1, 1-11
4
PUTs and AS isoforms. The data analyzed along with the GO and Pfam annotations in the study are publicly
available at:
It should be noted that the database also
contains AS data from fruit species including pineapple, apple, grape, orange, and strawberry (Wai et al., 2016;
Sablok et al., 2017).
2 Results and Discussion
2.1 Transcripts assembling, annotation and mapping to the maize genome
By pooling several sources of maize transcripts assembled from ESTs, mRNAs, and RNA-seq libraries, we
obtained a total of 614 201 putative unique transcripts with an average length of 815 bp (Table 1). Compared with
our previous assembled maize transcripts (Min et al., 2015), the number of PUTs was increased by ~26%, and the
length was also significantly increased from 466 bp to 815 bp. All the assembled PUTs were structurally and
functionally annotated including putative open reading frame (ORF) prediction, coding region full -length
prediction, a putative function and Pfam prediction. These basic features were summarized in Table 1. A total of
601 196 ORFs were predicted using OrfPredictor with 247 798 of them having a BLASTX hit against the UniProt
Swiss-Prot dataset (Min et al., 2015a) and 128 505 PUTs were predicted encoding full-length proteins by
TargetIdentifier (Min et al., 2015b). Among the predicted ORFs, 166 174 were annotated with a Pfam match
(Table 1).
Using the strict mapping parameters as described in methods, a total of 320 447 PUTs (52.2%) were mapped to
maize genome (Table 1). Among the assembled transcripts, a total of 298 248 PUTs matched to cDNA sequences
generated from 38 253 unique genes with ≥95% identity of an aligned pair and a minimum of 80 bp of aligned
length (Table 1). It should be noted that some PUTs matched to cDNA sequences generated from gene models
were not mapped to the genome as strict parameters were applied for mapping using ASFinder (Min, 2013).
Among a total of 320 447 PUTs mapped to the genome 206 593 PUTs matched to cDNA sequences generated
from a total of 37 751 unique genes. As it was mentioned above a total of 39 475 genes were annotated in the
recent release of gene models, thus 95.6% of the gene models were supported by at least one mapped PUT, i.e.,
transcribed in our assembled data. The mapped PUTs and predicted gene models were also visualized using
Generic Genome Browser (GBrowse) (
).
2.2 Detection and classification of alternative splicing events
ASFinder software was used to identify potential alternatively spliced isoforms based on the SIM4 output of
aligning PUTs to the maize genome (Min, 2013; Florea et al., 1998). The AS events were classified using the
AStalavista server (Foissac and Sammeth, 2007). A total of 192 624 AS events were detected and classified,
including 103 566 (53.8%) basic events and 89 058 (46.2%) complex events which were formed by combination
of various types of basic events (Table 2). These AS events were generated from 91 128 PUTs from 26 669
genomic loci. Among 91 128 alternatively spliced PUT isoforms, 81 260 matched to cDNAs of 20 860 gene
models. The isoforms not matching a gene models may represent new gene loci or lie in the untranslated regions
of known gene models. Similar to our previous studies in maize and other plants (Walter et al., 2013; VanBuren et
al., 2013, Min et al., 2015), the IR was the major splicing type among four basic AS types (Table 2). The
abundance of IR as a major AS event is consistent with previous reports in maize and other plant species (Min et
al., 2015; Wang and Brendel, 2006; Baek et al., 2008; Labadorf et al., 2010; Walters et al., 2013; Thatcher et al.,
2014). However, we observed that the proportion of complex events was positively correlated with the average
length of assembled transcripts. In this study the average length of the PUTs was 815 bp and the complex AS
events was accounted for 46.2%, while in our previous analysis, the average length of the 466 bp and the complex
event type was 20.4% (Min et al., 2015). This trend was observed with sorghum AS data (Min et al., 2015). AltA
(12.8%) and AltD (9.3%) represent the less abundant observed AS events with AltA showing a slightly higher
frequency as compared to AltD (Table 2) (Min et al., 2015). ES (7.5%) was the lowest occurred event in plants,
which was in line with the observed results in other studies (Min et al., 2115). Because a large number of
transcripts generated using RNA-seq techniques were incorporated in this work, the numbers of AS events in all
subtypes were significantly (7-folds) higher than the numbers of AS events previously identified (Table 2).