8 - CMB 1371页

I

n silico Proteomic Functional Re-annotation

Escherichia coli

K-12 using Dynamic Biological Data Fusion Strategy

38

such types of proteins can be viewed in REC-DB

database. The updated functional information of the

proteins, in turn, will help the researchers to develop

deeper insights into the molecular systems. The

complete re-annotated functional information is

available as supplementary Table S2. From REC-DB

search, we could able to retrieve the gene information

with clearly annotated function (Supplementary Table

S2, A) which was obtained from the re-annotation

results. Next, hypothetical gene hits can be obtained

which means that there were no functional predictions

obtained for the unknown gene sequences (available

as Supplementary Table S2, B) and the functional hits

with only predicted functions which need to be further

annotated (Supplementary Table S2, C).

1.4 Inconvenient Outcomes

Although our re-annotation was found to be efficient

in updating the functional information of the

E. coli

genome, there were a few inconveniences that had

occurred in the results and these difficulties are listed

below.

1.4.1 Transitive Catastrophe

Transitive catastrophe is a phenomenon whereby a

function is transferred to another on the basis of

sequence similarity searches but the original name is

incorrect (Salzberg, 2007). As more genomes are

annotated and several BLAST searches are carried out,

the functional representation of some protein

sequences gets transferred from one to another

function. It is well known that in the genomic data

resources thousands of such transitive errors have

propagated through sequence databases. Thus, in the

case of such incorrectly annotated information being

propagated through the sequence databases using

which re-annotation was carried out, then transitive

catastrophes, leading to false positive functional

predictions could have appeared.

These

inconveniences remained difficult to handle and were

unable to make critical decisions upon them. Hence,

such hassles are left open to the scientific community

or any expert for handling and suggestions.

1.5 REC-DB

The outcome of this research work has been published

online as a public database named “REC-DB – A

Re-annotated

Escherichia coli

Database”. Several

enhanced features have been incorporated within this

database for searching functions. In this database, user

can able to retrieve the re-annotated

E. coli

genome

data by querying REC-DB accession number (eg.

ec001), by choosing GenBank id (GI.No. 90111633)

or by giving Gene id (Gene-ID. 948195). While

querying, user may find “Null”, “No GI” and “No

Gene id” in search option which actually means that

there are no REC-DB function if it is queried as

“Null” and there are no GenBank id, if it is searched

as “No GI” and no gene id occurs in REC-DB, if it is

search as “No Gene id”.

2 Discussion

Although genome projects have the potential to

provide a better understanding of the organisms, the

lack of updated and accurate functional annotation for

the genome hampers the ability to exploit these data

for any further research on the organism. Hence in this

in silico functional proteomic re-annotation, an

attempt has been made to substantially update the

functions of the entire sequences of

E. coli

K-12,

incorporating a vast amount of research information

performed since the original annotation in 1997. Much

knowledge has been gained about the molecular

functions encoded by the

E. coli

K-12 genome.

Analyzing a single sequence using a regular BLAST

program (http://www.ebi.ac.uk/Tools/BLAST/), will

itself generate large amount of results in terms of hits

accompanied with varied parameters such as E-value,

Percentage of Identity, Percentage of Similarity,

BLAST score and sequence length. The results

obtained from BLAST with a maximum alignment

score and optimal E-value of 1×10-6 up to 1×10 -52

can be obtained as a result hit (Gabriel et al., 2008).

This requires a lot of human interventions to interpret

and choose the best positive hit. Thus, analyzing the

entire proteome of

E. coli

using a regular BLAST

program will be tedious (Aravindhan et al., 2009;

Hulo et al., 2004). AIM-BLAST with a well structured

and in a concise manner, supported us greatly in

performing sequence comparisons of the complete

genome of

E. coli

efficiently and in a very short span

of time (Aravindhan et al. 2009).

Computational

Molecular Biology