 
          Computational Molecular Biology 2016, Vol.6, No.4, 1-12
        
        
        
          2
        
        
          construct secretome databases for fungi, plants, and animals (Lum and Min, 2011; Lum et al., 2014; Meinken et
        
        
          al., 2014; Meiken et al., 2015). In this work, we describe the Protist Secretome and Subcellular Proteome
        
        
          Knowledgebase (ProtSecKB,
        
        
        
          ). The database will serve
        
        
          a useful resource for the community working with protist organisms for biomedical research.
        
        
          2 Methods of Database Construction
        
        
          2.1 Data collection
        
        
          The protist protein sequences were retrieved from the UniProtKB/Swiss-Prot dataset and the UniProtKB/TrEMBL
        
        
          dataset (release 2016-02) 
        
        
        
        
        
           using our in-house script. As proteins in the
        
        
          Kingdom Protista are actually not labeled as “Protist” or “Protista”, we retrieved all entries belonging to
        
        
          “Eukaryota” but not further classified as “Fungi”, “Metazoa”, or “Viridiplantae”. The UniProtKB/Swiss-Prot
        
        
          dataset contains manually annotated and reviewed protein sequences. The UniProtKB/TrEMBL dataset contains
        
        
          computationally analyzed protein sequences. The combined protist dataset consisted of a total of 1,970,022
        
        
          protein entries with 8,661 and 1,961,361 entries retrieved from the Swiss-Prot dataset and the TrEMBL dataset,
        
        
          respectively. The identifier mapping data including UniProt accession number (AC), UniProt ID, RefSeq
        
        
          accession number, and gi number were retrieved from the UniProt ID mapping data file. All data used in the
        
        
          database
        
        
          construction
        
        
          and
        
        
          analysis
        
        
          can
        
        
          be
        
        
          downloaded
        
        
          from
        
        
          the
        
        
          website
        
        
          at
        
        
        
          /.
        
        
          2.2
        
        
          Prediction of protein subcellular locations
        
        
          As similar approaches to using the same set prediction tools have been employed in construction of FunSecKB
        
        
          (Lum and Min, 2011), FunSecKB2 (Meinken et al., 2014), PlantSecKB (Lum et al., 2014), and MetazSecKB
        
        
          (Meinken et al., 2015) in our group, we only briefly introduce these tools in this work. For detailed information,
        
        
          the relevant references for each tool or the exemplar introduction by Lum and Min (2013) can be consulted. The
        
        
          software tools used in this work include SignalP (version 4.0), TargetP, Phobius, WoLF PSORT, TMHMM, and
        
        
          PS-Scan.  In brief, SignalP 4.0 was used for secretory signal peptide prediction (Petersen et al., 2011). However,
        
        
          we also included prediction information from SignalP 3.0 (Bendtsen et al., 2004) as it provides more accurate
        
        
          cleavage site prediction than SignalP 4.0 (Petersen et al., 2011). Phobius is a combined signal peptide and a
        
        
          transmembrane topology predictor (Käll et al., 2007). TargetP predicts the presence of any signal sequences such
        
        
          as signal peptide (SP), chloroplast transit peptide (cTP), or mitochondrial targeting peptide (mTP) in the
        
        
          N-terminus (Emanuelsson et al., 2007). TMHMM predicts the presence and topology of transmembrane helices
        
        
          and their orientation to the membrane (in/out) (Krogh et al., 2001). PS-Scan was used to scan the PROSITE
        
        
          database 
        
        
        
          ) for identifying ER targeting proteins (Prosite: PS00014)
        
        
          (Sigrist et al., 2010). WoLF PSORT predicts multiple subcellular locations including cytosol, cytoskeleton, ER,
        
        
          extracellular (secreted), Golgi apparatus, lysosome, mitochondria, nuclear, peroxisome, plasma membrane, and
        
        
          vacuolar membrane (Horton et al., 2007). As for all these programs, there were no specific parameters available
        
        
          for protists yet, the default parameters for eukaryotes or fungi, if available, were used, based on our previous
        
        
          evaluation (Min, 2010). We took the following procedure to assign a protein subcellular location. The annotated
        
        
          subcellular location in UniProtKB and our manual curation take precedence over computational prediction. Thus,
        
        
          only proteins not having an annotated subcellular location are subjected to computational assignment. However,
        
        
          the prediction information generated by all the tools is available for all proteins. It should be noted that some of
        
        
          the proteins may have more than one subcellular location.
        
        
          Membrane proteins:
        
        
          A membrane protein is a protein having one or more transmembrane domains predicted by
        
        
          TMHMM. However, if there is only one transmembrane domain predicted and located within the N-terminus 70
        
        
          amino acids, and also a signal peptide is predicted by SignalP 4.0, then this protein is not counted as a membrane
        
        
          protein.
        
        
          Mitochondrial proteins:
        
        
          Assignment of mitochondrial proteins was based on WoLF PSORT prediction. If it is