10 - CMB 1371页

I

n silico Proteomic Functional Re-annotation

Escherichia coli

K-12 using Dynamic Biological Data Fusion Strategy

40

biological sequences (Altschul et al., 1990). Since,

this tool remains computationally intensive and time

consuming as they employ a voluminous data transfer

and requires manual intervention for parsing BLAST

sequence analysis result. The E-value threshold for

BLAST search is 1*10-6 to 1*10-52 (Gabriel et al.

2008). To overcome such difficulties, we have

developed a program, AIM-BLAST that has been

interfaced with AJAX and SOAP services of EBI

(European Bioinformatics Institute) to support

multiple sequence searches at a stretch during

re-annotation. Further, AIM-BLAST has enhanced

features for performing automated parsing of the huge

blast results of individual sequences and presenting

them as “one sequence-one function” manner with

manual curation.

Figure 2 Schematic representation of the Re-annotation strategy

“Dynamic biological data fusion”

This flow chart describes the step by step procedures for

carrying out the re-annotation of

E.coli

K12 strain.

3.2.2 Sequence motifs and patterns search for

controlled annotation

Genome annotations describe that the function of

sequences are important to researchers during

laboratory investigations and when making

computational inferences. Re-annotation based on the

BLAST analysis alone would not be helpful in making

accurate functional re-annotation and predictions, as

in some cases these annotations may be inconsistent,

incomplete and erroneous (Karp et al., 2007). Hence,

sequence analysis based on the several approaches

such as motifs and pattern searches, phylogeny based

search, domain based search and family based search

were carried out to efficiently annotate the

E. coli

K-12 proteome. When comparing the protein

sequences, although they might not seem to be very

identical possess a short region of sequence “motif” in

common that is explicit to specific functions. Thus,

identifying such distinctive motif patterns in the

protein sequences could help in predicting the

functions on un-annotated proteins that contain similar

motifs. A few databases are dedicated to identify such

motifs patterns are available, of which ScanProsite

(http://expasy.org/tools/scanprosite/) was chosen for

this work to search for hits by specific motifs in the

protein databases. This tool makes use of

ProRules-context-dependent annotation templates to

discover functional and structural intra-domain

residues by scanning the protein sequences for the

occurrence of possible motifs and predicts their

function (Hulo et al., 2004).

3.2.3 Phylogenetic Classification based approach

The database of Clusters of Orthologous Groups of

proteins (COGs) consists of information on the

classifications of the proteins sequences based on their

phylogenies (Tatusov et al., 2000). This COGs

database serves as a best portal for carrying out

functional annotation of the proteins sequences based

on their genome evolution. To facilitate functional

studies, the COGs have been classified into 17 broad

functional categories, including a class for which only

a general functional prediction, usually that of

biochemical activity, was feasible and a class of

uncharacterized COGs. Additionally, COGs with

known functions are organized to represent specific

cellular systems and biochemical pathways. Thus,

sequence analysis using the COGNITOR program of

COG database would produce deeper insights to the

Computational

Molecular Biology