Computational Molecular Biology
            
            
              41
            
            
              protein functions based on their genome evolution.
            
            
              
                3.2.4 Domain based search
              
            
            
              Domains are the structural,
            
            
              functional and
            
            
              evolutionary units of proteins. Domains of common
            
            
              lineage are clustered into superfamilies. Thus,
            
            
              functional annotation based on such domain
            
            
              superfamilies will support enhanced knowledge on the
            
            
              protein functions. ProDom, an exclusive database of
            
            
              protein domain families has also been used that
            
            
              support our analysis based on the domain
            
            
              arrangements of proteins (Servant et al., 2002).
            
            
              ProDom can be useful in providing functional
            
            
              information of the proteins by carrying out a global
            
            
              comparison of the submitted sequence against all the
            
            
              available protein sequences. For ProDom a cutoff
            
            
              default E-value of 0.01 is kept and searched using
            
            
              ncbi-blastp program and multiple alignment method.
            
            
              
                3.2.5 Protein family search using Pfam
              
            
            
              Family based classification also remains as an
            
            
              important means of providing functional annotation
            
            
              for the biological sequences. Pfam (Finn et al., 2008)
            
            
              is a collection of multiple sequences alignments and
            
            
              profile Hidden Markov Models (HMM) that represent
            
            
              protein families. Information on the protein functions
            
            
              can thus be realized by comparing the sequences
            
            
              against the Pfam library of HMMs (Wu et al., 2003).
            
            
              Using Pfam database the complete proteome of
            
            
              
                E. coli
              
            
            
              was analyzed and the functions predicted based on
            
            
              these families. Although Pfam was integrated with the
            
            
              FGT program, Pfam analyses were carried out
            
            
              separately. The cutoff E-value is set as default 0.001.
            
            
              This is because, Pfam analysis generally consumes
            
            
              more time comparatively and running Pfam for a large
            
            
              number of protein sequences in FGT would affect its
            
            
              performance and slow down the overall process. But,
            
            
              at the same time, running each and every sequence
            
            
              separately in the Pfam server directly would also be a
            
            
              monotonous process. Hence, the Pfam FTP files were
            
            
              downloaded and installed into a local system and a
            
            
              stand alone Pfam was devised to support large-scale
            
            
              sequence analysis during the re-annotation.
            
            
              
                3.3 Hectic process of annotation
              
            
            
              ScanProsite, COG and ProDom were selected for
            
            
              re-analyzing the complete
            
            
              
                E. coli
              
            
            
              proteome, because
            
            
              they operate on different strategies to explore the
            
            
              biological roles of proteins. But there are some
            
            
              inconveniences prevailing with these tools. These
            
            
              tools do not allow multiple sequence searches at an
            
            
              instance. Also, for every single sequence search, these
            
            
              tools produce many hits and the users have to
            
            
              carefully interpret them and choose the best hit. After
            
            
              choosing the best hit, the users have to copy the
            
            
              appropriate function to a local database for final
            
            
              interpretation. Further, users have to simultaneously
            
            
              open and deal with multiple browsers when analyzing
            
            
              with these tools at a given time. This in turn consumes
            
            
              much of the man power and man hour. Hence, it
            
            
              would be an immense tiresome process for performing
            
            
              searches for the entire
            
            
              
                E. coli
              
            
            
              proteome using all these
            
            
              tools.
            
            
              
                3.4 In-house Functional Genomics Tool
              
            
            
              Understanding the complicatedness of dealing with
            
            
              many tools simultaneously, a simple and but novel
            
            
              system, Bioinfotracker (a Functional Genomics
            
            
              Tool-FGT), was developed for performing controlled
            
            
              annotation of the protein sequences locally, by
            
            
              concurrently using different online functional
            
            
              prediction tools (Kumar et al., 2009). FGT is a
            
            
              well-structured, flexible and a highly systematic
            
            
              functional analysis program developed by us to carry
            
            
              out large-scale protein annotation. Different online
            
            
              tools, operating on the diverse research strategies,
            
            
              such as ScanProsite, ProDom, COG and Pfam have
            
            
              been integrated in this tool. Once a sequence is
            
            
              submitted to this tool, the sequence is forwarded and
            
            
              submitted to the different servers of the tools
            
            
              integrated and the process is carried out at the
            
            
              individual servers in tandem. In ScanProsite the first
            
            
              option of scanning against PRSOTE collection of
            
            
              motif was used. For ProDom a cutoff default E-value
            
            
              of 0.01 is kept. The cutoff E-values for COG, Pfam
            
            
              are mentioned as 0.001. On Completion and when the
            
            
              results are available this tool will perform an
            
            
              automated parsing of the results to choose the best
            
            
              function, fetch them from the corresponding servers.
            
            
              Further, the results of the submitted sequences are
            
            
              provided as a simple table format that will be easier to
            
            
              interpret. Hence, FGT tool was very much supportive
            
            
              for carrying out the re-annotation of the complete
            
            
              Computational
            
            
              Molecular Biology