5 - CMB 1371页

Computational Molecular Biology

35

hypothetical and conserved hypothetical proteins must

be predicted as they might play a vital role in cellular

physiology of microorganisms. Hypothetical proteins

are proteins of unknown function with no homology

or experimental evidence and conserved hypothetical

proteins are unknown proteins with phylogenetic

distribution and homology (Wood et al., 2001; Riley et

al., 2006). These uncharacterized proteins might be

involved in regulation of gene expression, cell signal

transduction, host–parasite interaction and complex

secondary metabolism (including antibiotic and

biologically active compounds synthesis) and

therefore biochemically investigation of conserved

hypothetical proteins makes possible to discover new

biomolecules

with pharmacological

and

biotechnological significance (Galperin and Koonin,

2010; Roberts et al., 2001). The method of predicting

protein function using different bioinformatics tools

makes the annotation process easy and more efficient

(Altschul et al., 1999).

The wealth of biological information on

E. coli

is

increasing rapidly (Serres et al., 2001) and is

contributing to a better understanding of this organism

as well as functions encoded in other organisms (Karp

et al., 2007). It is therefore important that the most

up-to-date and accurate information on

E. coli

functions are made available for the use of scientific

community. Functional re-annotation, a process of

annotating a previously annotated genome, would

support in providing deeper insight into the genome

(Rajadurai et al., 2011). This process generally

involves a variety of computational techniques for

functional prediction. Such functional assignments

could also be achieved using more advanced high

throughput technologies employed and however such

techniques are highly laborious and expensive

(Valencia, 2005). Hence in silico functional

re-analysis would assist in making quick and reliable

functional predictions. The functional re-annotation

can potentially provide answers regarding higher

levels of cellular processes, such as metabolism,

transport, pathogenicity and regulation, thereby

facilitating the elucidation of individual protein in a

proteome (Zheng et al., 2002). Moreover, the results

of re-annotation would also be helpful in

understanding the dynamic interactions of the proteins

and the underlying mechanism of metabolic processes

since, all the processes are accomplished by large

ordered complexes or cascading proteins. It also helps

in identifying new protein functions that offer real

promise of new therapies for the communicable

diseases and genetic diseases. Previously, the genome

of several organisms including

Mycoplasma

pneumoniae

(Dandekar et al. 2000),

Mycobacterium

tuberculosis H37Rv

(Camus et al.,

2002),

Campylobacter jejuni

(Gundogdu et al., 2007),

Geobacter sulfurreducens

(Ashok et al., 2014) and

Saccharomyces cerevisiae

(Wood et al., 2001) were

successfully re-annotated using various computational

strategies and now they serve as useful pieces of

information in the biological research.

Similarly, the genome analysis work has managed to

analyze all the available annotations of

E. coli

and

provide a snapshot of their functional information.

However, at the end of their analyses, they reported

~14% of unknown sequences. They reported the

results of their analyses as text files and excel sheets.

Moreover, these analyses were carried out in 1997 and

2005, and by this time, a fresh set of annotations have

appeared (Blattner et al. 1997; Riley et al., 2006).

Hence, the aim of this work is to perform the manual

proteomic re-annotation of the complete proteome

sequences of

E. coli

K-12 strain using a multiple and

dynamic biological data fusion strategy, where

information from several protein databases are

carefully compared and analyzed before assigning

functions to the genome and make it available as a

public database that can be useful for the scientific

community dealing

E. coli

research. In this

re-annotation work, a dynamic biological data fusion

strategy has been implemented to perform sequence

functional prediction. This strategy generally deals

with the ability to dynamically form integrated data

sets from the data sources by combining the

heterogeneous data from database to maximize

knowledge sharing (Elmore et al., 2003).

In recent years, the accumulation of complete genome

sequences and related protein databases (Pearson and

Lipman, 1988) provide useful comparisons with the

close relatives among other organisms and facilitates

Computational

Molecular Biology