9 - CMB 1371页

Computational Molecular Biology

39

Each of the

E. coli

K-12 protein sequences previously

predicted and annotated has been manually

re-analyzed based on the diverse approaches such as

similarity based search approach (BLASTP), Pattern

based search (ScanProsite),

Phylogenetic

classification based search (COG), Domain based

search (ProDom) and Proteins Family based search

(Pfam). Each approach has different emphasis and

collects different sets of information related to the

function of the gene products. Further, since each of

these databases has been designed for specific

problems and therefore have their own inherent

strengths and weaknesses (Rust et al., 2002).

Our re-annotation strategy helped in postulating

featured functions to almost 29% of the protein

sequences. For example (Supplementary Table S1),

the hypothetical protein encoded by the sequence

ec0903 has now been newly assigned as Formate

dehydrogenase. Importantly, ec0903 whose Refseq id

YP_001165334.1 was previously reported as

hypothetical protein and they used for vaccine

prediction for infectious disease in human for which

assigned a new function (Xiang and He, 2009).

Re-annotation also helped in revising the previously

annotated proteins. In several cases, the original

annotation was very wide and less precise ones. While

analyzing such sequences, we found that they were

poorly annotated and were consigned with a false

positive function (Aravindhan et al., 2009).

As a result of our manual re-annotation, almost 29%

of the genes, whose functions were not determined

earlier, are now assigned with a known function. The

data presented in REC-DB should be useful for

analysis of

E. coli

gene products as well as gene

products encoded by other genomes. Hence, we

believe that our re-annotation should be useful for the

scientific community in

E. coli

research.

3 Material and Methods

3.1 Protein Sequences

The complete protein sequences of

E. coli

K-12

organism were downloaded from the EcoCyc

Database (Keseler et al., 2005) and analyzed. The

previous functional genomics analysis on

E. coli

showed that there were totally 4,290 protein sequences

of which only 60% of the sequences were found to

have clear/known (with functions) functions. The

remaining 40% of sequence functions were left as

unclear/unknown genes (without functions). The

protein functions of unknown genes such as

hypothetical and conserved hypothetical proteins must

be predicted as they might play a significant role in

cellular physiology of microorganisms. Hypothetical

proteins are proteins of unknown function with no

homology or experimental evidence and conserved

hypothetical proteins are unknown proteins with

phylogenetic distribution and homology (Tao et al.,

1999).

3.2 Functional annotation of Sequences

The complete genome sequence comprises of totally

4,290 protein sequences of which 2,560 protein

sequences have clear/known function and 1,730

sequence functions are unclear/unknown (Table 1).

Computers play a significant role in sequence analysis

and re-annotation as it reduces the analysis time taken

for the processing of large amounts of data and

through the integration of several approaches

(Nascimento and Bazzan, 2005). Here, the functional

re-annotation of entire

E. coli

proteome was carried

out using the advanced re-annotation strategies in

which several sequence analysis methods were

incorporated into a coherent and an efficient

annotation schema (Figure 2).

3.2.1 Similarity Search Approach using AIM

BLAST

Proteins that are evolutionarily related are commonly

referred to as homologues and close homologues often

have similar functions (Ofran et al., 2005). Based on

this concept, homology-based or similarity based

transfer of functional annotations remain a native

prediction method to assign functions of unknown

proteins that have not been previously annotated. In

turn, due to the serious and quicker accumulation of

the fresh biological information in the protein databases,

this similarity search approach would also help in revising

or updating the previously annotated functions.

BLAST, Basic Local Alignment Search Tool is one of

the most favorite and widely used Bioinformatics

program for identifying the similarity between the

Computational

Molecular Biology