6 - CMB 1371页

I

n silico Proteomic Functional Re-annotation

Escherichia coli

K-12 using Dynamic Biological Data Fusion Strategy

36

powerful re-annotation. In silico re-annotation permits

uniform quality control, systemic updates, easy data

parsing and more comprehensive comparative analysis,

providing a valuable resource to the whole research

community (Rajadurai et al., 2011). Though, several

genome and proteome database are publicly available

to facilitate life science research, there are a number of

pitfalls associated with each database. The most

common problems in these biological sequence

databases are lack of reliability, redundancy, erroneous

annotation etc. Recently, the misannotation levels in

the four popularly used public protein sequence

databases, UniProtKB/Swiss-Prot, GenBank NR,

UniProtKB/TrEMBL,

and KEGG have been

investigated and identified the misannotated enzyme

superfamilies that remains a larger problem during

annotation process (Schnoes et al. 2009).

To alleviate the problem of erroneous annotation and

redundancy, re-annotation of genes and proteins using

a set of common, controlled context to describe a gene

or protein function is necessary. Thus, deciphering the

precise functions encoded by all gene products of this

genome remains a great challenge in this genomic era

(Bock and Gough, 2004; Altschul et al., 1990). Hence,

to overcome such challenges, the re-annotation of the

E. coli

K-12 proteome has been carried out using a

strategy known as “dynamic biological data fusion” in

which the biological data from various available

databases are integrated into a unique information

source. Further, confidence level has been carried out

to assign functions of unknown proteins and thereby

facilitating more accurate functional information to

the research society.

1 Results

The original sequence annotations of

E. coli

K-12

strain downloaded from EcoCyc database and it was

identified to possess 4,290 protein sequences. Of these

sequences, 2,560 sequences had clear annotations and

the remaining 1,730 sequences were found to be

uncharacterized with hypothetical,

unknown,

predicted, conserved and putative functions (Table 1,

Figure 1a). Following the in silico functional

re-annotation, several categories of changes were

made in the previously annotated E. coli genome that

includes i) Assigning functions to uncharacterized

proteins ii) Transfer of functions (revision of already

annotated function) and iii) Updating the functions.

Figure 1 EcoCyc Genome Data

Note: a. A pie chart describing the percentage of known and

unknown sequences in the original data downloaded from

EcoCyc. b. Genome data after re-annotation. A pie chart

representing the percentage of known and unknown sequences

after re-annotation (Rec-DB data)

Table 1 Genome features of

E. coli

K-12

Sequence Category

Number of Sequences

Percentage (%)

Total No. of protein Sequences

4290

100

i. Sequences with unknown functions:

Predicted

1033

24.08

Conserved

401

9.35

Putative

212

4.94

Conserved + Hypothetical

18

0.42

Hypothetical

59

1.38

Conserved + Predicted

2

0.05

Conserved + Putative

1

0.02

Hypothetical + Predicted

2

0.05

Putative + Predicted

2

0.05

Total Sequences with unknown functions

1730

40

ii. Sequences with clear functions

2560

60

Computational

Molecular Biology