I
n silico Proteomic Functional Re-annotation
Escherichia coli
K-12 using Dynamic Biological Data Fusion Strategy
40
biological sequences (Altschul et al., 1990). Since,
this tool remains computationally intensive and time
consuming as they employ a voluminous data transfer
and requires manual intervention for parsing BLAST
sequence analysis result. The E-value threshold for
BLAST search is 1*10-6 to 1*10-52 (Gabriel et al.
2008). To overcome such difficulties, we have
developed a program, AIM-BLAST that has been
interfaced with AJAX and SOAP services of EBI
(European Bioinformatics Institute) to support
multiple sequence searches at a stretch during
re-annotation. Further, AIM-BLAST has enhanced
features for performing automated parsing of the huge
blast results of individual sequences and presenting
them as “one sequence-one function” manner with
manual curation.
Figure 2 Schematic representation of the Re-annotation strategy
“Dynamic biological data fusion”
This flow chart describes the step by step procedures for
carrying out the re-annotation of
E.coli
K12 strain.
3.2.2 Sequence motifs and patterns search for
controlled annotation
Genome annotations describe that the function of
sequences are important to researchers during
laboratory investigations and when making
computational inferences. Re-annotation based on the
BLAST analysis alone would not be helpful in making
accurate functional re-annotation and predictions, as
in some cases these annotations may be inconsistent,
incomplete and erroneous (Karp et al., 2007). Hence,
sequence analysis based on the several approaches
such as motifs and pattern searches, phylogeny based
search, domain based search and family based search
were carried out to efficiently annotate the
E. coli
K-12 proteome. When comparing the protein
sequences, although they might not seem to be very
identical possess a short region of sequence “motif” in
common that is explicit to specific functions. Thus,
identifying such distinctive motif patterns in the
protein sequences could help in predicting the
functions on un-annotated proteins that contain similar
motifs. A few databases are dedicated to identify such
motifs patterns are available, of which ScanProsite
(http://expasy.org/tools/scanprosite/) was chosen for
this work to search for hits by specific motifs in the
protein databases. This tool makes use of
ProRules-context-dependent annotation templates to
discover functional and structural intra-domain
residues by scanning the protein sequences for the
occurrence of possible motifs and predicts their
function (Hulo et al., 2004).
3.2.3 Phylogenetic Classification based approach
The database of Clusters of Orthologous Groups of
proteins (COGs) consists of information on the
classifications of the proteins sequences based on their
phylogenies (Tatusov et al., 2000). This COGs
database serves as a best portal for carrying out
functional annotation of the proteins sequences based
on their genome evolution. To facilitate functional
studies, the COGs have been classified into 17 broad
functional categories, including a class for which only
a general functional prediction, usually that of
biochemical activity, was feasible and a class of
uncharacterized COGs. Additionally, COGs with
known functions are organized to represent specific
cellular systems and biochemical pathways. Thus,
sequence analysis using the COGNITOR program of
COG database would produce deeper insights to the
Computational
Molecular Biology