Computational Molecular Biology, 2018, Vol.8, No.1, 1-13
3
transcripts (PUTs) including 28,316 contigs and 250,734 singlets were obtained for mapping to the genome
sequences.
The assembled PUTs were mapped to their corresponding chromosomes using ASFinder (
.
ysu.edu/tools/ASFinder.html/) (Min, 2013). We applied the threshold values: a minimum of 95% identity, a
minimum of 80 bp aligned length, and > 75% of a PUT sequence aligned to the genome (Walters et al., 2013).
ASFinder uses SIM4 program (Florea et al., 1998) to align PUTs to the genome, and then subsequently identifies
those PUTs that are mapped to the same genomic location and have variable exon-intron boundaries as AS
isoforms. To avoid chimeric PUT assemblies, mapped PUTs having an intron size > 100 kb were removed for AS
identification. The output file (AS. gtf) from ASFinder was submitted to AStalavista server (
.
crg.es/astalavista/) for AS event classification (Foissac and Sammeth, 2007). The rate of alternative splicing genes
was estimated as the ratio of genomic loci having alternative splicing PUT isoforms over total genomic loci
having at least one mapped PUT sequence.
1.3 Functional annotation of PUTs and data availability
The PUT sequences were functionally annotated, including prediction protein coding region and domain search.
The coding region of each PUT was predicted using the ORFPredictor (Min et al., 2005a) and the full-length
transcript coverage was assessed using TargetIdentifier (Min et al., 2005b). Functional classification was based on
the BLASTX search with an E-value threshold of 1e-5 against UniProtKB/Swiss-Prot. In addition, predicted
protein sequences from ORFPredictor were further annotated for functional domains using rpsBLAST against the
PFam database (
/). The assembled PUTs were further compared with transcripts of predicted
gene models using BLASTN with a cut off E-value of 1e-10, ≥ 95% identity and a minimum aligned length of
80 bp. Gene Ontology (GO) information was extracted from the UniProt ID mapping table based on the BLASTX
search of PUTs sequences against the UniProtKB/Swiss-Prot (ftp://ftp.uniprot.org/pub/databases/uniprot/current_
release/). The GO categories were further analyzed using GO Slim Viewer using plant specific GO terms
(
) (McCarthy et al., 2006).
1.4 Availability of data
The assembled PUTs and AS events identified in this study along with the predicted gene models, as well as data
reported previously in our group, are available from Plant Alternative Splicing Database
.
ysu.edu/altsplice/) (VanBuren et al., 2013; Walters et al., 2013; Min et al., 2015; Wai et al., 2016; Min, 2017;
Sablok et al., 2017). BLAST search is also available for searching the PUTs and AS isoforms. The datasets
supporting the conclusions of this article including the data used for database construction and the supplementary
data are publicly available at:
/.
2 Results and Discussion
2.1 Transcripts assembling and annotation
After removing contaminant and low complexity sequences of the combined ESTs and mRNAs of cotton (
G.
hirsutum
L.), we used CAP3 program to assemble the cleaned data. A total of 279,050 putative unique transcripts
(PUTs) including 28,316 contigs and 250,734 singlets were obtained for further annotation and mapping to the
genome sequences. The PUTs had a length ranged from 100 bp to 20,499 bp and an average length of 975 bp
(Table 1). All PUTs were structurally and functionally annotated including ORF prediction, coding region
completeness assessment, a putative function and PFam prediction. These basic features were summarized in
Table 1. A total of 278,650 ORFs were predicted using OrfPredictor including 201,924 of them were predicted
using the frame values obtained from BLASTX search against the UniProt Swiss-Prot dataset and 72,726 ORFs
were predicted based the intrinsic sequence signals in the sequences (Min et al., 2005a). Among them 128,505
PUTs were predicted encoding full-length proteins by TargetIdentifier (Min et al., 2005b). Among the predicted
ORFs, 166,174 were annotated with a protein family (Pfam) match (Table 1). Further, using BLASTN search with
a cutoff of 95% identity 247,871 (88.8%) PUTs matched with predicted mRNA sequences of predicted protein
coding gene models (Li et al., 2015).