I
            
            
              n silico Proteomic Functional Re-annotation
            
            
              
                Escherichia coli
              
            
            
              K-12 using Dynamic Biological Data Fusion Strategy
            
            
              36
            
            
              powerful re-annotation. In silico re-annotation permits
            
            
              uniform quality control, systemic updates, easy data
            
            
              parsing and more comprehensive comparative analysis,
            
            
              providing a valuable resource to the whole research
            
            
              community (Rajadurai et al., 2011). Though, several
            
            
              genome and proteome database are publicly available
            
            
              to facilitate life science research, there are a number of
            
            
              pitfalls associated with each database. The most
            
            
              common problems in these biological sequence
            
            
              databases are lack of reliability, redundancy, erroneous
            
            
              annotation etc. Recently, the misannotation levels in
            
            
              the four popularly used public protein sequence
            
            
              databases, UniProtKB/Swiss-Prot, GenBank NR,
            
            
              UniProtKB/TrEMBL,
            
            
              and KEGG have been
            
            
              investigated and identified the misannotated enzyme
            
            
              superfamilies that remains a larger problem during
            
            
              annotation process (Schnoes et al. 2009).
            
            
              To alleviate the problem of erroneous annotation and
            
            
              redundancy, re-annotation of genes and proteins using
            
            
              a set of common, controlled context to describe a gene
            
            
              or protein function is necessary. Thus, deciphering the
            
            
              precise functions encoded by all gene products of this
            
            
              genome remains a great challenge in this genomic era
            
            
              (Bock and Gough, 2004; Altschul et al., 1990). Hence,
            
            
              to overcome such challenges, the re-annotation of the
            
            
              
                E. coli
              
            
            
              K-12 proteome has been carried out using a
            
            
              strategy known as “dynamic biological data fusion” in
            
            
              which the biological data from various available
            
            
              databases are integrated into a unique information
            
            
              source. Further, confidence level has been carried out
            
            
              to assign functions of unknown proteins and thereby
            
            
              facilitating more accurate functional information to
            
            
              the research society.
            
            
              
                1 Results
              
            
            
              The original sequence annotations of
            
            
              
                E. coli
              
            
            
              K-12
            
            
              strain downloaded from EcoCyc database and it was
            
            
              identified to possess 4,290 protein sequences. Of these
            
            
              sequences, 2,560 sequences had clear annotations and
            
            
              the remaining 1,730 sequences were found to be
            
            
              uncharacterized with hypothetical,
            
            
              unknown,
            
            
              predicted, conserved and putative functions (Table 1,
            
            
              Figure 1a). Following the in silico functional
            
            
              re-annotation, several categories of changes were
            
            
              made in the previously annotated E. coli genome that
            
            
              includes i) Assigning functions to uncharacterized
            
            
              proteins ii) Transfer of functions (revision of already
            
            
              annotated function) and iii) Updating the functions.
            
            
              Figure 1 EcoCyc Genome Data
            
            
              Note: a. A pie chart describing the percentage of known and
            
            
              unknown sequences in the original data downloaded from
            
            
              EcoCyc. b. Genome data after re-annotation. A pie chart
            
            
              representing the percentage of known and unknown sequences
            
            
              after re-annotation (Rec-DB data)
            
            
              Table 1 Genome features of
            
            
              
                E. coli
              
            
            
              K-12
            
            
              Sequence Category
            
            
              Number of Sequences
            
            
              Percentage (%)
            
            
              Total No. of protein Sequences
            
            
              4290
            
            
              100
            
            
              i. Sequences with unknown functions:
            
            
              Predicted
            
            
              1033
            
            
              24.08
            
            
              Conserved
            
            
              401
            
            
              9.35
            
            
              Putative
            
            
              212
            
            
              4.94
            
            
              Conserved + Hypothetical
            
            
              18
            
            
              0.42
            
            
              Hypothetical
            
            
              59
            
            
              1.38
            
            
              Conserved + Predicted
            
            
              2
            
            
              0.05
            
            
              Conserved + Putative
            
            
              1
            
            
              0.02
            
            
              Hypothetical + Predicted
            
            
              2
            
            
              0.05
            
            
              Putative + Predicted
            
            
              2
            
            
              0.05
            
            
              Total Sequences with unknown functions
            
            
              1730
            
            
              40
            
            
              ii. Sequences with clear functions
            
            
              2560
            
            
              60
            
            
              Computational
            
            
              Molecular Biology