11 - CMB 1371页

Computational Molecular Biology

41

protein functions based on their genome evolution.

3.2.4 Domain based search

Domains are the structural,

functional and

evolutionary units of proteins. Domains of common

lineage are clustered into superfamilies. Thus,

functional annotation based on such domain

superfamilies will support enhanced knowledge on the

protein functions. ProDom, an exclusive database of

protein domain families has also been used that

support our analysis based on the domain

arrangements of proteins (Servant et al., 2002).

ProDom can be useful in providing functional

information of the proteins by carrying out a global

comparison of the submitted sequence against all the

available protein sequences. For ProDom a cutoff

default E-value of 0.01 is kept and searched using

ncbi-blastp program and multiple alignment method.

3.2.5 Protein family search using Pfam

Family based classification also remains as an

important means of providing functional annotation

for the biological sequences. Pfam (Finn et al., 2008)

is a collection of multiple sequences alignments and

profile Hidden Markov Models (HMM) that represent

protein families. Information on the protein functions

can thus be realized by comparing the sequences

against the Pfam library of HMMs (Wu et al., 2003).

Using Pfam database the complete proteome of

E. coli

was analyzed and the functions predicted based on

these families. Although Pfam was integrated with the

FGT program, Pfam analyses were carried out

separately. The cutoff E-value is set as default 0.001.

This is because, Pfam analysis generally consumes

more time comparatively and running Pfam for a large

number of protein sequences in FGT would affect its

performance and slow down the overall process. But,

at the same time, running each and every sequence

separately in the Pfam server directly would also be a

monotonous process. Hence, the Pfam FTP files were

downloaded and installed into a local system and a

stand alone Pfam was devised to support large-scale

sequence analysis during the re-annotation.

3.3 Hectic process of annotation

ScanProsite, COG and ProDom were selected for

re-analyzing the complete

E. coli

proteome, because

they operate on different strategies to explore the

biological roles of proteins. But there are some

inconveniences prevailing with these tools. These

tools do not allow multiple sequence searches at an

instance. Also, for every single sequence search, these

tools produce many hits and the users have to

carefully interpret them and choose the best hit. After

choosing the best hit, the users have to copy the

appropriate function to a local database for final

interpretation. Further, users have to simultaneously

open and deal with multiple browsers when analyzing

with these tools at a given time. This in turn consumes

much of the man power and man hour. Hence, it

would be an immense tiresome process for performing

searches for the entire

E. coli

proteome using all these

tools.

3.4 In-house Functional Genomics Tool

Understanding the complicatedness of dealing with

many tools simultaneously, a simple and but novel

system, Bioinfotracker (a Functional Genomics

Tool-FGT), was developed for performing controlled

annotation of the protein sequences locally, by

concurrently using different online functional

prediction tools (Kumar et al., 2009). FGT is a

well-structured, flexible and a highly systematic

functional analysis program developed by us to carry

out large-scale protein annotation. Different online

tools, operating on the diverse research strategies,

such as ScanProsite, ProDom, COG and Pfam have

been integrated in this tool. Once a sequence is

submitted to this tool, the sequence is forwarded and

submitted to the different servers of the tools

integrated and the process is carried out at the

individual servers in tandem. In ScanProsite the first

option of scanning against PRSOTE collection of

motif was used. For ProDom a cutoff default E-value

of 0.01 is kept. The cutoff E-values for COG, Pfam

are mentioned as 0.001. On Completion and when the

results are available this tool will perform an

automated parsing of the results to choose the best

function, fetch them from the corresponding servers.

Further, the results of the submitted sequences are

provided as a simple table format that will be easier to

interpret. Hence, FGT tool was very much supportive

for carrying out the re-annotation of the complete

Computational

Molecular Biology