Computational Molecular Biology
35
hypothetical and conserved hypothetical proteins must
be predicted as they might play a vital role in cellular
physiology of microorganisms. Hypothetical proteins
are proteins of unknown function with no homology
or experimental evidence and conserved hypothetical
proteins are unknown proteins with phylogenetic
distribution and homology (Wood et al., 2001; Riley et
al., 2006). These uncharacterized proteins might be
involved in regulation of gene expression, cell signal
transduction, host–parasite interaction and complex
secondary metabolism (including antibiotic and
biologically active compounds synthesis) and
therefore biochemically investigation of conserved
hypothetical proteins makes possible to discover new
biomolecules
with pharmacological
and
biotechnological significance (Galperin and Koonin,
2010; Roberts et al., 2001). The method of predicting
protein function using different bioinformatics tools
makes the annotation process easy and more efficient
(Altschul et al., 1999).
The wealth of biological information on
E. coli
is
increasing rapidly (Serres et al., 2001) and is
contributing to a better understanding of this organism
as well as functions encoded in other organisms (Karp
et al., 2007). It is therefore important that the most
up-to-date and accurate information on
E. coli
functions are made available for the use of scientific
community. Functional re-annotation, a process of
annotating a previously annotated genome, would
support in providing deeper insight into the genome
(Rajadurai et al., 2011). This process generally
involves a variety of computational techniques for
functional prediction. Such functional assignments
could also be achieved using more advanced high
throughput technologies employed and however such
techniques are highly laborious and expensive
(Valencia, 2005). Hence in silico functional
re-analysis would assist in making quick and reliable
functional predictions. The functional re-annotation
can potentially provide answers regarding higher
levels of cellular processes, such as metabolism,
transport, pathogenicity and regulation, thereby
facilitating the elucidation of individual protein in a
proteome (Zheng et al., 2002). Moreover, the results
of re-annotation would also be helpful in
understanding the dynamic interactions of the proteins
and the underlying mechanism of metabolic processes
since, all the processes are accomplished by large
ordered complexes or cascading proteins. It also helps
in identifying new protein functions that offer real
promise of new therapies for the communicable
diseases and genetic diseases. Previously, the genome
of several organisms including
Mycoplasma
pneumoniae
(Dandekar et al. 2000),
Mycobacterium
tuberculosis H37Rv
(Camus et al.,
2002),
Campylobacter jejuni
(Gundogdu et al., 2007),
Geobacter sulfurreducens
(Ashok et al., 2014) and
Saccharomyces cerevisiae
(Wood et al., 2001) were
successfully re-annotated using various computational
strategies and now they serve as useful pieces of
information in the biological research.
Similarly, the genome analysis work has managed to
analyze all the available annotations of
E. coli
and
provide a snapshot of their functional information.
However, at the end of their analyses, they reported
~14% of unknown sequences. They reported the
results of their analyses as text files and excel sheets.
Moreover, these analyses were carried out in 1997 and
2005, and by this time, a fresh set of annotations have
appeared (Blattner et al. 1997; Riley et al., 2006).
Hence, the aim of this work is to perform the manual
proteomic re-annotation of the complete proteome
sequences of
E. coli
K-12 strain using a multiple and
dynamic biological data fusion strategy, where
information from several protein databases are
carefully compared and analyzed before assigning
functions to the genome and make it available as a
public database that can be useful for the scientific
community dealing
E. coli
research. In this
re-annotation work, a dynamic biological data fusion
strategy has been implemented to perform sequence
functional prediction. This strategy generally deals
with the ability to dynamically form integrated data
sets from the data sources by combining the
heterogeneous data from database to maximize
knowledge sharing (Elmore et al., 2003).
In recent years, the accumulation of complete genome
sequences and related protein databases (Pearson and
Lipman, 1988) provide useful comparisons with the
close relatives among other organisms and facilitates
Computational
Molecular Biology