CMB-2018v8n1 - page 6

Computational Molecular Biology, 2018, Vol.8, No.1, 1-13

http://cmb.biopublisher.ca

transcripts (PUTs) including 28,316 contigs and 250,734 singlets were obtained for mapping to the genome

sequences.

The assembled PUTs were mapped to their corresponding chromosomes using ASFinder (

http://proteomics

ysu.edu/tools/ASFinder.html/) (Min, 2013). We applied the threshold values: a minimum of 95% identity, a

minimum of 80 bp aligned length, and > 75% of a PUT sequence aligned to the genome (Walters et al., 2013).

ASFinder uses SIM4 program (Florea et al., 1998) to align PUTs to the genome, and then subsequently identifies

those PUTs that are mapped to the same genomic location and have variable exon-intron boundaries as AS

isoforms. To avoid chimeric PUT assemblies, mapped PUTs having an intron size > 100 kb were removed for AS

identification. The output file (AS. gtf) from ASFinder was submitted to AStalavista server (

http://genome

crg.es/astalavista/) for AS event classification (Foissac and Sammeth, 2007). The rate of alternative splicing genes

was estimated as the ratio of genomic loci having alternative splicing PUT isoforms over total genomic loci

having at least one mapped PUT sequence.

1.3 Functional annotation of PUTs and data availability

The PUT sequences were functionally annotated, including prediction protein coding region and domain search.

The coding region of each PUT was predicted using the ORFPredictor (Min et al., 2005a) and the full-length

transcript coverage was assessed using TargetIdentifier (Min et al., 2005b). Functional classification was based on

the BLASTX search with an E-value threshold of 1e-5 against UniProtKB/Swiss-Prot. In addition, predicted

protein sequences from ORFPredictor were further annotated for functional domains using rpsBLAST against the

PFam database (

http://pfam.xfam.org

/). The assembled PUTs were further compared with transcripts of predicted

gene models using BLASTN with a cut off E-value of 1e-10, ≥ 95% identity and a minimum aligned length of

80 bp. Gene Ontology (GO) information was extracted from the UniProt ID mapping table based on the BLASTX

search of PUTs sequences against the UniProtKB/Swiss-Prot (ftp://ftp.uniprot.org/pub/databases/uniprot/current_

release/). The GO categories were further analyzed using GO Slim Viewer using plant specific GO terms

(

http://www.agbase.msstate.edu/cgi-bin/tools/goslimviewer_select.pl

) (McCarthy et al., 2006).

1.4 Availability of data

The assembled PUTs and AS events identified in this study along with the predicted gene models, as well as data

reported previously in our group, are available from Plant Alternative Splicing Database

(http://proteomics

ysu.edu/altsplice/) (VanBuren et al., 2013; Walters et al., 2013; Min et al., 2015; Wai et al., 2016; Min, 2017;

Sablok et al., 2017). BLAST search is also available for searching the PUTs and AS isoforms. The datasets

supporting the conclusions of this article including the data used for database construction and the supplementary

data are publicly available at:

http://bioinformatics.ysu.edu/publication/data/Cotton

2 Results and Discussion

2.1 Transcripts assembling and annotation

After removing contaminant and low complexity sequences of the combined ESTs and mRNAs of cotton (

hirsutum

L.), we used CAP3 program to assemble the cleaned data. A total of 279,050 putative unique transcripts

(PUTs) including 28,316 contigs and 250,734 singlets were obtained for further annotation and mapping to the

genome sequences. The PUTs had a length ranged from 100 bp to 20,499 bp and an average length of 975 bp

(Table 1). All PUTs were structurally and functionally annotated including ORF prediction, coding region

completeness assessment, a putative function and PFam prediction. These basic features were summarized in

Table 1. A total of 278,650 ORFs were predicted using OrfPredictor including 201,924 of them were predicted

using the frame values obtained from BLASTX search against the UniProt Swiss-Prot dataset and 72,726 ORFs

were predicted based the intrinsic sequence signals in the sequences (Min et al., 2005a). Among them 128,505

PUTs were predicted encoding full-length proteins by TargetIdentifier (Min et al., 2005b). Among the predicted

ORFs, 166,174 were annotated with a protein family (Pfam) match (Table 1). Further, using BLASTN search with

a cutoff of 95% identity 247,871 (88.8%) PUTs matched with predicted mRNA sequences of predicted protein

coding gene models (Li et al., 2015).

SEO Version

Warning.

You are currently viewing the SEO version of !text.
It has a number of design and functionality limitations.

We recommend viewing the Flash version or the basic HTML version of this publication.

1,2,3,4,5 7,8,9,10,11,12,13,14,15,16,...18