 
          Computational Molecular Biology 2016, Vol.6, No.4, 1-12
        
        
        
          9
        
        
          Note: Abbreviation: HLS: highly likely secreted; LS: likely secreted; Cyt: cytoplasm (or cytosol); Plasm: plasma membrane; Mt mem:
        
        
          mitochondrial membrane; Mt non-m: mitochondrial non-membrane; Nuc mem: nuclear membrane; Nuc non-m: nuclear
        
        
          non-membrane; Sec: secretome
        
        
          For example,
        
        
          D. discoideum
        
        
          had 52 secreted proteins with DUF3430 domain (unknown function) and 44 secreted
        
        
          proteins with carbohydrate binding domain CBM49, while the other two species had no such protein family at all.
        
        
          As expected, there were a large number of secreted Elicitin, RXLR phytopathogen effector protein, necrosis
        
        
          inducing protein (NPP1), phytotoxin PcF protein, trypsin, etc. in
        
        
          P. infestans
        
        
          , which may be related to its lifestyle
        
        
          as a plant pathogen (Meijer et al., 2014).
        
        
          T. cruzi,
        
        
          not surprisingly as a human parasite pathogen, had 345
        
        
          Mucin-like glycoprotein, 198 BNR repeat-like domain, and 102 Peptidase_M8 (Leishmanolysin), etc. in its
        
        
          secretome while the other two species did not have any for those categories. These secreted proteins may play an
        
        
          important role for
        
        
          T. cruzi
        
        
          for invading and infecting humans and causing Chagas' disease (Costa et al., 2016).
        
        
          4 Discussion
        
        
          We constructed the ProtSecKB to provide a resource of curated and predicted subcellular locations of protist
        
        
          proteins. As all the tools we selected to use were not specifically trained for protists, the prediction accuracies
        
        
          were lower than prediction accuracies in other eukaryotes including fungi, plants and animals (Lum and Min,
        
        
          2011; Lum et al., 2014; Meiken et al., 2014; Meiken et al., 2015). However, our evaluation using curated protein
        
        
          subcellular locations showed that the prediction specificities for nearly all subcellular locations except nucleus
        
        
          were > 90%, and in particular, prediction of secreted proteins had an MCC value of 0.71 with 89.0% sensitivity
        
        
          and 96.2% specificity (Table 1). Thus we concluded that the prediction of secreted proteins was relatively reliable.
        
        
          Other tools are also available as webservers including the Cell-PLoc servers (Chou and Shen, 2008) and some
        
        
          others (Meinken and Min, 2012). These tools and their related publications can be found at our website
        
        
        
           (Meinken and Min, 2012). As standalone tools are not available
        
        
          for some, such as Cell-PLoc, or too slow to processing large datasets, we were not able to use them for our data
        
        
          processing. However, we suggest users utilize these tools to get a second prediction for proteins of interest as our
        
        
          experience showed that using multiple tools improves prediction specificity.
        
        
          Recently the efforts had been made by our research group to improve the prediction accuracies of subcellular
        
        
          locations in plant proteins (Neizer-Ashun et al., 2015), fungal proteins (Munyon et al., 2015), and animal/human
        
        
          proteins (Khavari, 2016) using various statistics algorithms. The results were mixed for different subcellular
        
        
          locations using different methods with different eukaryotic proteins. However, some of the algorithms were
        
        
          promising in improving the prediction accuracy. When enough experimental protist protein subcellular location
        
        
          data are available, a specific tool will need to be implemented for protist protein subcellular location prediction.
        
        
          ProtSecKB contains 101 unique protist species within some of them having multiple strains resulting in a total of
        
        
          127 organisms having complete proteomes. The database allows that each subcellular proteome in each species
        
        
          can be searched and downloaded for detailed comparative analysis. As an example for the usage of the database,
        
        
          our analysis on protein families using three species having different lifestyles demonstrated that the secretome in
        
        
          each species may play an important role in determining their lifestyles (Table 3). We also have implemented a
        
        
          curation tool accessible through ProtSecKB for the community to manually curate subcellular locations of protist
        
        
          proteins having experimental evidence. We anticipate the database resource will facilitate the protist research
        
        
          community to design further experiments characterizing protist proteins and understanding protist biology,
        
        
          particularly of the plant, human and animal protist pathogens.
        
        
          Authors' contributions
        
        
          XM and CC conceived the work; BP, VA and JM implemented the database; GK curated proteins. XM, BP, FY
        
        
          analyzed the data. XM, BP, JM and CC prepared the manuscript. All authors read and approved the final
        
        
          manuscript.