Computational Molecular Biology 2016, Vol.6, No.4, 1-12
2
construct secretome databases for fungi, plants, and animals (Lum and Min, 2011; Lum et al., 2014; Meinken et
al., 2014; Meiken et al., 2015). In this work, we describe the Protist Secretome and Subcellular Proteome
Knowledgebase (ProtSecKB,
). The database will serve
a useful resource for the community working with protist organisms for biomedical research.
2 Methods of Database Construction
2.1 Data collection
The protist protein sequences were retrieved from the UniProtKB/Swiss-Prot dataset and the UniProtKB/TrEMBL
dataset (release 2016-02)
using our in-house script. As proteins in the
Kingdom Protista are actually not labeled as “Protist” or “Protista”, we retrieved all entries belonging to
“Eukaryota” but not further classified as “Fungi”, “Metazoa”, or “Viridiplantae”. The UniProtKB/Swiss-Prot
dataset contains manually annotated and reviewed protein sequences. The UniProtKB/TrEMBL dataset contains
computationally analyzed protein sequences. The combined protist dataset consisted of a total of 1,970,022
protein entries with 8,661 and 1,961,361 entries retrieved from the Swiss-Prot dataset and the TrEMBL dataset,
respectively. The identifier mapping data including UniProt accession number (AC), UniProt ID, RefSeq
accession number, and gi number were retrieved from the UniProt ID mapping data file. All data used in the
database
construction
and
analysis
can
be
downloaded
from
the
website
at
/.
2.2
Prediction of protein subcellular locations
As similar approaches to using the same set prediction tools have been employed in construction of FunSecKB
(Lum and Min, 2011), FunSecKB2 (Meinken et al., 2014), PlantSecKB (Lum et al., 2014), and MetazSecKB
(Meinken et al., 2015) in our group, we only briefly introduce these tools in this work. For detailed information,
the relevant references for each tool or the exemplar introduction by Lum and Min (2013) can be consulted. The
software tools used in this work include SignalP (version 4.0), TargetP, Phobius, WoLF PSORT, TMHMM, and
PS-Scan. In brief, SignalP 4.0 was used for secretory signal peptide prediction (Petersen et al., 2011). However,
we also included prediction information from SignalP 3.0 (Bendtsen et al., 2004) as it provides more accurate
cleavage site prediction than SignalP 4.0 (Petersen et al., 2011). Phobius is a combined signal peptide and a
transmembrane topology predictor (Käll et al., 2007). TargetP predicts the presence of any signal sequences such
as signal peptide (SP), chloroplast transit peptide (cTP), or mitochondrial targeting peptide (mTP) in the
N-terminus (Emanuelsson et al., 2007). TMHMM predicts the presence and topology of transmembrane helices
and their orientation to the membrane (in/out) (Krogh et al., 2001). PS-Scan was used to scan the PROSITE
database
) for identifying ER targeting proteins (Prosite: PS00014)
(Sigrist et al., 2010). WoLF PSORT predicts multiple subcellular locations including cytosol, cytoskeleton, ER,
extracellular (secreted), Golgi apparatus, lysosome, mitochondria, nuclear, peroxisome, plasma membrane, and
vacuolar membrane (Horton et al., 2007). As for all these programs, there were no specific parameters available
for protists yet, the default parameters for eukaryotes or fungi, if available, were used, based on our previous
evaluation (Min, 2010). We took the following procedure to assign a protein subcellular location. The annotated
subcellular location in UniProtKB and our manual curation take precedence over computational prediction. Thus,
only proteins not having an annotated subcellular location are subjected to computational assignment. However,
the prediction information generated by all the tools is available for all proteins. It should be noted that some of
the proteins may have more than one subcellular location.
Membrane proteins:
A membrane protein is a protein having one or more transmembrane domains predicted by
TMHMM. However, if there is only one transmembrane domain predicted and located within the N-terminus 70
amino acids, and also a signal peptide is predicted by SignalP 4.0, then this protein is not counted as a membrane
protein.
Mitochondrial proteins:
Assignment of mitochondrial proteins was based on WoLF PSORT prediction. If it is