Computational Molecular Biology
39
Each of the
E. coli
K-12 protein sequences previously
predicted and annotated has been manually
re-analyzed based on the diverse approaches such as
similarity based search approach (BLASTP), Pattern
based search (ScanProsite),
Phylogenetic
classification based search (COG), Domain based
search (ProDom) and Proteins Family based search
(Pfam). Each approach has different emphasis and
collects different sets of information related to the
function of the gene products. Further, since each of
these databases has been designed for specific
problems and therefore have their own inherent
strengths and weaknesses (Rust et al., 2002).
Our re-annotation strategy helped in postulating
featured functions to almost 29% of the protein
sequences. For example (Supplementary Table S1),
the hypothetical protein encoded by the sequence
ec0903 has now been newly assigned as Formate
dehydrogenase. Importantly, ec0903 whose Refseq id
YP_001165334.1 was previously reported as
hypothetical protein and they used for vaccine
prediction for infectious disease in human for which
assigned a new function (Xiang and He, 2009).
Re-annotation also helped in revising the previously
annotated proteins. In several cases, the original
annotation was very wide and less precise ones. While
analyzing such sequences, we found that they were
poorly annotated and were consigned with a false
positive function (Aravindhan et al., 2009).
As a result of our manual re-annotation, almost 29%
of the genes, whose functions were not determined
earlier, are now assigned with a known function. The
data presented in REC-DB should be useful for
analysis of
E. coli
gene products as well as gene
products encoded by other genomes. Hence, we
believe that our re-annotation should be useful for the
scientific community in
E. coli
research.
3 Material and Methods
3.1 Protein Sequences
The complete protein sequences of
E. coli
K-12
organism were downloaded from the EcoCyc
Database (Keseler et al., 2005) and analyzed. The
previous functional genomics analysis on
E. coli
showed that there were totally 4,290 protein sequences
of which only 60% of the sequences were found to
have clear/known (with functions) functions. The
remaining 40% of sequence functions were left as
unclear/unknown genes (without functions). The
protein functions of unknown genes such as
hypothetical and conserved hypothetical proteins must
be predicted as they might play a significant role in
cellular physiology of microorganisms. Hypothetical
proteins are proteins of unknown function with no
homology or experimental evidence and conserved
hypothetical proteins are unknown proteins with
phylogenetic distribution and homology (Tao et al.,
1999).
3.2 Functional annotation of Sequences
The complete genome sequence comprises of totally
4,290 protein sequences of which 2,560 protein
sequences have clear/known function and 1,730
sequence functions are unclear/unknown (Table 1).
Computers play a significant role in sequence analysis
and re-annotation as it reduces the analysis time taken
for the processing of large amounts of data and
through the integration of several approaches
(Nascimento and Bazzan, 2005). Here, the functional
re-annotation of entire
E. coli
proteome was carried
out using the advanced re-annotation strategies in
which several sequence analysis methods were
incorporated into a coherent and an efficient
annotation schema (Figure 2).
3.2.1 Similarity Search Approach using AIM
BLAST
Proteins that are evolutionarily related are commonly
referred to as homologues and close homologues often
have similar functions (Ofran et al., 2005). Based on
this concept, homology-based or similarity based
transfer of functional annotations remain a native
prediction method to assign functions of unknown
proteins that have not been previously annotated. In
turn, due to the serious and quicker accumulation of
the fresh biological information in the protein databases,
this similarity search approach would also help in revising
or updating the previously annotated functions.
BLAST, Basic Local Alignment Search Tool is one of
the most favorite and widely used Bioinformatics
program for identifying the similarity between the
Computational
Molecular Biology