GC2 Biology Dictates Gene Expressivity in
Camellia sinensis
22
coding region. We also predicted the heterogeneity of
codon usage by analyzing the effective number of
codons (Nc). The mean and standard deviation value
of Nc are 15.2 and 0.42637 respectively, indicating
that there is a wide variation of codon usage bias
among the genes of
Camellia sinensis
. The variation
of codon usage bias among the genes is further
confirmed from the distributions of (G+C) at the third
synonymous codon positions. These results indicate
that apart from compositional constraints, other trends
might influence the overall codon usage variation
among the genes in
Camellia sinensis
.
Each gene has evolved a codon usage pattern
accommodating gene expression level, and RCBS
value >0.5 and CAI value >0.5 exhibits favorable
codon usage. We calculated the CAI and RCBS values
for the genes and it was found that six out of ten genes
of
Camellia sinensis
qualify as highly expressed genes.
We also analyzed the GC content distribution on
relative position of codons; our results revealed that
except for the gene PPO all other genes have ideal GC
percentage.
We analyzed normalized AT and GC frequency at each
codon site. We observed that correlations between
gene expression as measured by CAI and GC content
at any codon site are very weak. GC2s showed
moderate positive correlation (0.604) with gene
expression. We also measured the correlations
between CAI and AT content at any codon site. AT2s
showed moderate negative correlation (-0.604) with
gene expression.
Moreover, our analysis further revealed that the
second position of synonymous codons in
Camellia
sinensis
played a more prominent role than the third
position, as indicated by the positive correlation
coefficient (0.064) between CAI and GC2s as
compared to correlation coefficients (0.069 and 0.187)
of CAI with GC1s and GC3s, in determining the level
of gene expression. This contradicts the fact that the
third position of codon in
E. coli
plays a major role in
determining gene expression although both
Camellia
sinensis
and
E. coli
are AT rich. This was further
confirmed by the highest negative correlation between
CAI and AT2s (-0.064) in comparison to the
correlation coefficients (-0.069 and -0.172) of CAI
with AT1s and AT3s. This might be due to small
number of coding sequences and only the genes with
high CAI and RCBS taken for the present
investigation.
The compositional bias of cds plays a crucial role in
shaping the codon usage. GC content has a major
influence on codon usage bias, resulting in a close
association between GC% at the third codon position,
also called GC3 biology. As all amino acids (except
methionine and tryptophan) allow GC-changing
synonymous substitutions in the codon third position,
this has led to a common belief that the use of
synonymous G/C-ending codons could increase the
expressivity of genes, while the usage of A/T-ending
codons could decrease the level of gene expression.
For this analysis, we initially downloaded 350 coding
sequences of
Camellia sinensis
, out of which only ten
cds were found to begin with the initiator codon ATG,
and length as exact multiple of three bases and devoid
of N (any unknown base). As evident from the CUB
analysis of the cds and correlation analysis between
GC/AT content at three codon sites with the CAI value,
our results suggest that the 2
nd
position of synonymous
codons in
Camellia sinensis
possibly plays a more
prominent role than the 3
rd
position of codons in
determining the gene expressivity.
3 Materials and Methods
3.1 Datasets
The coding sequences (cds) of
Camellia sinensis
were
downloaded from NCBI (www.ncbi.nlm.nih.gov). To
minimize sampling errors we have taken only those
cds which are greater than or equal to 1000 bp and
have the correct initial and termination codons, devoid
of N (any unknown base). The accurate coding
sequences were retrieved using a program in perl
developed by us. Finally, ten (10) sequences were
selected for CUB analysis.
3.2 Models
Relative codon usage bias and codon adaptation index
were used to study the overall codon usage variation
among the genes. RCBS is the difference of observed
frequency of a codon from the expected frequency
under the hypothesis of random codon usage where
Computational
Molecular Biology