Computational Molecular Biology
4
however, is available for all plant proteins. Some of
the proteins may have more than one subcellular
location. The following criteria are applied for
computational classification of protein subcellular
locations:
Membrane proteins: A protein predicted to contain one
or more transmembrane domains by TMHMM is
classified as a membrane protein. However, if there is
only one transmembrane domain predicted and that is
located within the N-terminus 70 amino acids, and
also a signal peptide is predicted by SignalP 4.0, this
protein is not counted as a membrane protein.
Chloroplast proteins: A protein predicted as “C” (for
chloroplast) for subcellular location by TargetP is
classified as a chloroplast protein. If it is also
classified as a membrane protein, then it is further
classified as chloroplast membrane protein.
Mitochondrial proteins: A protein predicted as “M”
(for mitochondrial) for subcellular location by TargetP
is classified as a mitochondrial protein. If it is also
classified as a membrane protein, then it is further
classified as mitochondrial membrane protein.
ER proteins: Proteins predicted to contain a signal
peptide by SignalP 4.0 and an ER target signal
(Prosite: PS00014) by PS-Scan were treated as
luminal ER proteins.
Complete secretomes: A secretome is all secreted
proteins from a species. Only proteins that are
predicted to have a secretory signal peptide by all
three predictors - SignalP 4.0, Phobius, and TargetP -
and that are not classified as any of the above
categories are included in the secreteome. However,
proteins that are not classified as any of the above
categories and are predicted to have a signal peptide
by one or two of the predictors are assigned as
“weakly likely secreted” or “likely secreted” as our
previous evaluation revealed that a signal peptide in
some annotated secreted proteins can only be detected
by one or two predictors (Lum and Min, 2011a).
Using all three predictors, which increases the
specificity of secretome prediction,
improves
prediction accuracy (Min, 2010; Melhem et al., 2013).
All manually curated secreted and extracellular
proteins are included in the complete secretomes.
Curated secreted proteins: This category includes
proteins which are annotated to be “secreted” or
“extracellular” or “cell wall” in the subcellular
location from the UniProtKB/Swiss-Prot data set
which are “reviewed”. It also includes manually
collected secreted proteins from recent literature by
our curators.
GPI-anchored proteins: Signal peptide containing
proteins that were predicted to have a GPI anchor by
FragAnchor were further classified as GPI-anchored
proteins. Protein sequences predicted to have a signal
peptide and a GPI anchor may attach to the outer
leaflet of the plasma membrane or be secreted
becoming components of the cell wall. These proteins
are involved in signaling, adhesion, stress response,
and cell wall remodeling or play other roles in growth
and development (Borner et al., 2002; Borner et al.,
2003; Gillmor et al., 2005; Simpson et al., 2009).
Proteins in other subcellular locations: Other
subcellular locations including cytosol (cytoplasm),
cytoskeleton, Golgi apparatus, lysosome, nucleus,
peroxisome, plasma membrane and vacuole were
predicted by WoLF PSORT.
1.3 Computational prediction accuracies of protein
subcellular locations
The prediction methods we used above were
developed based on our previous evaluation of
computational tools (Min, 2010; Meinken and Min,
2012; Melhem et al., 2013). To estimate the prediction
accuracies of our methods for each subcellular
location we used two datasets (Table 1). Dataset A
consists of 15 028 proteins. This dataset contains
proteins from the UniProtKB/Swiss-Prot dataset with
a curated subcellular location. Proteins having
multiple subcellular locations or labeled as “fragment”
were excluded. Dataset B consist of 6 908 proteins
which were generated from Dataset A after excluding
entries having a term of “by similarity” or “probable”
or “predicted” in subcellular location annotation. In
comparing with other methods using a single tool, our
method - i.e. using a combination of multiple tools
Computational
Molecular Biology