Computational Molecular Biology
41
protein functions based on their genome evolution.
3.2.4 Domain based search
Domains are the structural,
functional and
evolutionary units of proteins. Domains of common
lineage are clustered into superfamilies. Thus,
functional annotation based on such domain
superfamilies will support enhanced knowledge on the
protein functions. ProDom, an exclusive database of
protein domain families has also been used that
support our analysis based on the domain
arrangements of proteins (Servant et al., 2002).
ProDom can be useful in providing functional
information of the proteins by carrying out a global
comparison of the submitted sequence against all the
available protein sequences. For ProDom a cutoff
default E-value of 0.01 is kept and searched using
ncbi-blastp program and multiple alignment method.
3.2.5 Protein family search using Pfam
Family based classification also remains as an
important means of providing functional annotation
for the biological sequences. Pfam (Finn et al., 2008)
is a collection of multiple sequences alignments and
profile Hidden Markov Models (HMM) that represent
protein families. Information on the protein functions
can thus be realized by comparing the sequences
against the Pfam library of HMMs (Wu et al., 2003).
Using Pfam database the complete proteome of
E. coli
was analyzed and the functions predicted based on
these families. Although Pfam was integrated with the
FGT program, Pfam analyses were carried out
separately. The cutoff E-value is set as default 0.001.
This is because, Pfam analysis generally consumes
more time comparatively and running Pfam for a large
number of protein sequences in FGT would affect its
performance and slow down the overall process. But,
at the same time, running each and every sequence
separately in the Pfam server directly would also be a
monotonous process. Hence, the Pfam FTP files were
downloaded and installed into a local system and a
stand alone Pfam was devised to support large-scale
sequence analysis during the re-annotation.
3.3 Hectic process of annotation
ScanProsite, COG and ProDom were selected for
re-analyzing the complete
E. coli
proteome, because
they operate on different strategies to explore the
biological roles of proteins. But there are some
inconveniences prevailing with these tools. These
tools do not allow multiple sequence searches at an
instance. Also, for every single sequence search, these
tools produce many hits and the users have to
carefully interpret them and choose the best hit. After
choosing the best hit, the users have to copy the
appropriate function to a local database for final
interpretation. Further, users have to simultaneously
open and deal with multiple browsers when analyzing
with these tools at a given time. This in turn consumes
much of the man power and man hour. Hence, it
would be an immense tiresome process for performing
searches for the entire
E. coli
proteome using all these
tools.
3.4 In-house Functional Genomics Tool
Understanding the complicatedness of dealing with
many tools simultaneously, a simple and but novel
system, Bioinfotracker (a Functional Genomics
Tool-FGT), was developed for performing controlled
annotation of the protein sequences locally, by
concurrently using different online functional
prediction tools (Kumar et al., 2009). FGT is a
well-structured, flexible and a highly systematic
functional analysis program developed by us to carry
out large-scale protein annotation. Different online
tools, operating on the diverse research strategies,
such as ScanProsite, ProDom, COG and Pfam have
been integrated in this tool. Once a sequence is
submitted to this tool, the sequence is forwarded and
submitted to the different servers of the tools
integrated and the process is carried out at the
individual servers in tandem. In ScanProsite the first
option of scanning against PRSOTE collection of
motif was used. For ProDom a cutoff default E-value
of 0.01 is kept. The cutoff E-values for COG, Pfam
are mentioned as 0.001. On Completion and when the
results are available this tool will perform an
automated parsing of the results to choose the best
function, fetch them from the corresponding servers.
Further, the results of the submitted sequences are
provided as a simple table format that will be easier to
interpret. Hence, FGT tool was very much supportive
for carrying out the re-annotation of the complete
Computational
Molecular Biology