Genomics and Applied Biology 2015, Vol. 6, No. 6, 1-7
2
Khobondo et al. (2015) confirmed the existence of
codon usage bias in the porcine genome which might
suggest there is weak selection of preferred codons for
translation accuracy. The codon usage bias was
influenced subtle by nucleotide composition factors
(GC, GC3, CDS length) among others. In the study,
there was a negative correlation between genomic
codon adaptation index (gCAI), a proxy of codon
usage bias and GC content or GC3s. However, this
finding contradicted other findings (Hershberg and
Petrov, 2010) and was attributed to the difference in
the genome isochore structure, ambiguity (vary with
space and time) of the gene expression in mammals,
or due to difference in methodology of calculating
codon adaptation index (CAI) variants. The negative
correlation was reported between gCAI of pig and
gene length and was consistent with other reports in
organism such as yeast,
Caenorhabditis elegans,
Drosophila melanogaster, Arabidopsis thaliana, Populus
tremula
and
Silene latifolia
(Qiu et al., 2011). This
correlation shows that metabolic systems prefer to
express those genes that are less costly (Hahn and
Kern, 2005). Despite this evidence on CUB, it is not
known how this phenomenon (codon usage bias) may
affect gene functionality and paucity. Therefore, this
study was done to relate the pig CUB (5% of each
genes showing highest and lowest biasness) to gene
ontologies and functional genomics.
2 Materials and Methods
2.1 Sequence data
The genome sequence used for analysis was
downloaded from Ensemble v68 (
Sus scrofa
build
10.2) using BioMart (Ensembl v68). A total of
23,269 coding sequences was extracted from the
female Duroc pig breed as the reference genome,.
Only 21,550 CDS with more than 50 amino acids
(150 bp) were included for analysis. Gene ontology
(GO) terms were downloaded from Ensembl genome
browser as well.
2.2 Codon indices: Genomic Codon Adaptation
Index (gCAI)
Genomic codon adaptation index (gCAI) used in this
study was computed earlier (Khobondo et al., 2015) as
the geometric mean relative synonymous codon usage
(RSCU) divided by the highest possible geometric
mean of RSCU given the same amino acid (AA)
sequence using an in house perl script.
Therefore, the value gCAI is a proxy for codon bias
because values are normalized using codon frequencies
at equilibrium, thus there is no assumption of
expression bias (Khobondo et al., 2015).
2.3 Analysis tools
An in house Perl script was used to derive codon
indices as described by Khobondo et al. (2015). Five
percent (5%) of most and least bias genes according to
gCAI were extracted and grouped in two categories
(low and high bias). Because not all pig genes have
associated gene names, the genes without gene names
were blasted against the human Refseq mRNAs and
human reference protein sequences (blastn and blastp
respectively) and the best human hit was assigned as
gene name. Human orthologs of porcine genes were
used to perform gene ontology (GO) analysis. BinGO
v2.44 (Maere et al., 2005) a plugin of Cytoscape
v2.8.3 (Shannon et al., 2003) was used to identify
enriched GO terms using human gene annotation as
background. Hypergeometric test was used to assess
the significance of the enriched terms and Benjamini
and Hochberg correction was implemented for
multiple comparisons. Validation of over-represented
GO terms from BinGO was done using a Perl script
that compared the GO terms between the two files
(selected highest or lowest biased) and all GO terms
downloaded from Ensembl genome browser. Statistical
significance was computed using a chi- square test. In
order to correct for false enrichment, P-value
threshold of 0.0001 was used as significant value for
GO analysis.
3 Results
3.1 High codon usage bias and Gene Ontology terms
Gene ontology analysis on the 5% high and low CUB
genes using BinGO and validated by in-house Perl
script found 28 and 71 GO terms to be significantly
enriched in highly and lowly CUB genes, respectively.
The significant GO terms covered all the three gene
ontology domains of cellular components, biological
processes and molecular functions. Notable associated
GO terms like cell surface, plasma membrane, nucleolus,
nucleoplasm and nucleus showing anatomical structures
are cellular components related to biological processes.
The over-representation of ribosome, actin binding for
translation and holding cellular matrix (mentioned
above), were expected in highly biased genes. The
same apply for heme binding for oxygen supply in all