CMB-2016v6n4 - page 5

Computational Molecular Biology 2016, Vol.6, No.4, 1-12

http://cmb.biopublisher.ca

construct secretome databases for fungi, plants, and animals (Lum and Min, 2011; Lum et al., 2014; Meinken et

al., 2014; Meiken et al., 2015). In this work, we describe the Protist Secretome and Subcellular Proteome

Knowledgebase (ProtSecKB,

http://bioinformatics.ysu.edu/secretomes/protist/index.php

). The database will serve

a useful resource for the community working with protist organisms for biomedical research.

2 Methods of Database Construction

2.1 Data collection

The protist protein sequences were retrieved from the UniProtKB/Swiss-Prot dataset and the UniProtKB/TrEMBL

dataset (release 2016-02)

(

http://www.uniprot.org/downloads

)

using our in-house script. As proteins in the

Kingdom Protista are actually not labeled as “Protist” or “Protista”, we retrieved all entries belonging to

“Eukaryota” but not further classified as “Fungi”, “Metazoa”, or “Viridiplantae”. The UniProtKB/Swiss-Prot

dataset contains manually annotated and reviewed protein sequences. The UniProtKB/TrEMBL dataset contains

computationally analyzed protein sequences. The combined protist dataset consisted of a total of 1,970,022

protein entries with 8,661 and 1,961,361 entries retrieved from the Swiss-Prot dataset and the TrEMBL dataset,

respectively. The identifier mapping data including UniProt accession number (AC), UniProt ID, RefSeq

accession number, and gi number were retrieved from the UniProt ID mapping data file. All data used in the

database

construction

and

analysis

can

downloaded

from

the

website

http://proteomics.ysu.edu/publication/data/ProtSecKB

2.2

Prediction of protein subcellular locations

As similar approaches to using the same set prediction tools have been employed in construction of FunSecKB

(Lum and Min, 2011), FunSecKB2 (Meinken et al., 2014), PlantSecKB (Lum et al., 2014), and MetazSecKB

(Meinken et al., 2015) in our group, we only briefly introduce these tools in this work. For detailed information,

the relevant references for each tool or the exemplar introduction by Lum and Min (2013) can be consulted. The

software tools used in this work include SignalP (version 4.0), TargetP, Phobius, WoLF PSORT, TMHMM, and

PS-Scan. In brief, SignalP 4.0 was used for secretory signal peptide prediction (Petersen et al., 2011). However,

we also included prediction information from SignalP 3.0 (Bendtsen et al., 2004) as it provides more accurate

cleavage site prediction than SignalP 4.0 (Petersen et al., 2011). Phobius is a combined signal peptide and a

transmembrane topology predictor (Käll et al., 2007). TargetP predicts the presence of any signal sequences such

as signal peptide (SP), chloroplast transit peptide (cTP), or mitochondrial targeting peptide (mTP) in the

N-terminus (Emanuelsson et al., 2007). TMHMM predicts the presence and topology of transmembrane helices

and their orientation to the membrane (in/out) (Krogh et al., 2001). PS-Scan was used to scan the PROSITE

database

(http://www.expasy.org/tools/scanprosite/

) for identifying ER targeting proteins (Prosite: PS00014)

(Sigrist et al., 2010). WoLF PSORT predicts multiple subcellular locations including cytosol, cytoskeleton, ER,

extracellular (secreted), Golgi apparatus, lysosome, mitochondria, nuclear, peroxisome, plasma membrane, and

vacuolar membrane (Horton et al., 2007). As for all these programs, there were no specific parameters available

for protists yet, the default parameters for eukaryotes or fungi, if available, were used, based on our previous

evaluation (Min, 2010). We took the following procedure to assign a protein subcellular location. The annotated

subcellular location in UniProtKB and our manual curation take precedence over computational prediction. Thus,

only proteins not having an annotated subcellular location are subjected to computational assignment. However,

the prediction information generated by all the tools is available for all proteins. It should be noted that some of

the proteins may have more than one subcellular location.

Membrane proteins:

A membrane protein is a protein having one or more transmembrane domains predicted by

TMHMM. However, if there is only one transmembrane domain predicted and located within the N-terminus 70

amino acids, and also a signal peptide is predicted by SignalP 4.0, then this protein is not counted as a membrane

protein.

Mitochondrial proteins:

Assignment of mitochondrial proteins was based on WoLF PSORT prediction. If it is

SEO Version

Warning.

You are currently viewing the SEO version of !text.
It has a number of design and functionality limitations.

We recommend viewing the Flash version or the basic HTML version of this publication.

1,2,3,4 6,7,8,9,10,11,12,13,14,15,...16