CMB-2018v8n1 - page 7

Computational Molecular Biology, 2018, Vol.8, No.1, 1-13
4
2.2 Mapping transcripts to cotton genome
We used relatively strict mapping parameters to map PUTs to the genome as described in the method section. The
identity threshold of 95% prevented PUTs mismatching to the genome segments with lower similarities due to
ancient genome or gene duplications. In other hand, it allowed accurate mapping by tolerating errors in PUT
sequences that might be generated in original ESTs or in the assembling process. In addition, there might be
sequence errors in the assembled genome sequences or variations in different varieties or ecotypes of the same
species. We have used the same procedure of PUTs mapping to the genomes in other plant species including
cereal plants and fruit plants (Min et al., 2015; Min, 2017; Sablok et al., 2017). A total of 196,098 (70.3% of the
total assembled PUTs) PUTs were mapped to
G. hirsutum
genome, including 113,180 PUTs were mapped to a
single genomic locus and 82,918 PUTs were mapped to two or more genomic loci (Table 2). The reason for the
relative larger number of PUTs (42.3% of mapped PUTs) having more than one mapping loci was apparently due
to
G. hirsutum
AtDt genome consisting of both A subgenome and D subgenome (Li et al., 2015; Zhang et al.,
2015). Diploid genomes of
G. arboretum
(AA) and
G. raimondii
(DD), which were diverged about 5-10 million
years ago (MYA), have been sequenced (Wang et al., 2012; Li et al., 2014a). Our analysis show that the
homologues mRNA sequences of diploid
G. arboretum
(AA) and
G. raimondii
(DD), still share 97-100% identity.
Table 1 Basic features of the assembled putative unique transcripts (PUTs) of cotton plant
Total
PUTs
Average length
of PUTs (bp)
BLASTX matches against
Swiss-Prot data (%)
Total ORFs
(%)
Full-length
PUTs (%)
Pfam matches
(%)
PUTs match with predicted
gene models (%)
279050 975
201924 (72.4)
278650 (99.9) 115043 (41.2) 155446 (55.7) 247871 (88.8)
Table 2 Mapping of putative unique transcripts (PUTs) to cotton genome
PUTs mapped to
genome (% of total
PUTs)
PUTs mapped to
single locus (% of
mapped PUTs)
PUTs mapped to two or
more loci (% of mapped
PUTs)
Total genomic loci
with mapped PUTs
Genomics loci with
alternative splicing (AS)
AS rate (%)
196098 (70.3)
113180 (57.7)
82918 (42.3)
88420
23930
27.1
The PUTs were mapped to a total of 88,420 genomic loci (Table 2). This number was higher than the number of
genes reported by the genome sequencing projects, as 76,943 genes were reported by Li et al. (2015) and 66,434
genes were reported by Zhang et al. (2015). The mapped PUTs that were located in the regions outside of the
predicted genes may contain genes remained to be annotated.
It should be noted that there were 29.7% of the PUTs not being mapped to the draft genome sequences (Table 2).
The reasons for these PUTs not being mapped may include incompleteness of the genome sequences and possible
errors in the PUTs or genomic sequences including sequencing errors and misassembling. However, these
unmapped PUTs were annotated and available from our database, the information might be useful for identifying
new genes from cotton species.
2.3 Detection and classification of alternative splicing events
The PUTs to genome mapping gtf (gene transfer format) file generated by ASFinder was submitted to the
AStalavista server for identification and classification of AS events (Foissac and Sammeth, 2007; Min, 2013). A
total of 56,080 AS events were detected and classified, including 41,150 (73.4%) basic events and 14,930 (26.6%)
complex events which had more than one basic event (Figure 1). These AS events were generated from 23,930
genomic loci (clusters) with 44,239 unique transcripts (Table 2). As a total of 88,150 genomic loci have at least
one PUT mapped, thus, the estimated AS rates of genes generating AS isoforms (AS genes) was 27.1% in cotton
(Table 2). However, based on the PUTs mapping data, there were 25,427 genomic loci having PUTs not having an
intron. Thus, only considering the genomic loci mapped with PUTs having an introns or introns, the AS rate was
40.0% in this dataset in cotton. The AS rate in cotton is lower than the rate in Arabidopsis (~60%) and in maize
(55%) reported previously (Marquez et al., 2012; Mei et al., 2017; Min, 2017), this apparently due to relative
lower number of available EST and mRNA sequences used in current analysis. Recently, RNA-seq analysis in
G.
raimondii
and
G. davidsonii
revealed 31.6% and 32.0% AS rates, respectively, in intron-containing genes (Li et
1,2,3,4,5,6 8,9,10,11,12,13,14,15,16,17,...18
Powered by FlippingBook