Page 7 - ME-436-v3-3

Computational Molecular Biology

4

however, is available for all plant proteins. Some of

the proteins may have more than one subcellular

location. The following criteria are applied for

computational classification of protein subcellular

locations:

Membrane proteins: A protein predicted to contain one

or more transmembrane domains by TMHMM is

classified as a membrane protein. However, if there is

only one transmembrane domain predicted and that is

located within the N-terminus 70 amino acids, and

also a signal peptide is predicted by SignalP 4.0, this

protein is not counted as a membrane protein.

Chloroplast proteins: A protein predicted as “C” (for

chloroplast) for subcellular location by TargetP is

classified as a chloroplast protein. If it is also

classified as a membrane protein, then it is further

classified as chloroplast membrane protein.

Mitochondrial proteins: A protein predicted as “M”

(for mitochondrial) for subcellular location by TargetP

is classified as a mitochondrial protein. If it is also

classified as a membrane protein, then it is further

classified as mitochondrial membrane protein.

ER proteins: Proteins predicted to contain a signal

peptide by SignalP 4.0 and an ER target signal

(Prosite: PS00014) by PS-Scan were treated as

luminal ER proteins.

Complete secretomes: A secretome is all secreted

proteins from a species. Only proteins that are

predicted to have a secretory signal peptide by all

three predictors - SignalP 4.0, Phobius, and TargetP -

and that are not classified as any of the above

categories are included in the secreteome. However,

proteins that are not classified as any of the above

categories and are predicted to have a signal peptide

by one or two of the predictors are assigned as

“weakly likely secreted” or “likely secreted” as our

previous evaluation revealed that a signal peptide in

some annotated secreted proteins can only be detected

by one or two predictors (Lum and Min, 2011a).

Using all three predictors, which increases the

specificity of secretome prediction,

improves

prediction accuracy (Min, 2010; Melhem et al., 2013).

All manually curated secreted and extracellular

proteins are included in the complete secretomes.

Curated secreted proteins: This category includes

proteins which are annotated to be “secreted” or

“extracellular” or “cell wall” in the subcellular

location from the UniProtKB/Swiss-Prot data set

which are “reviewed”. It also includes manually

collected secreted proteins from recent literature by

our curators.

GPI-anchored proteins: Signal peptide containing

proteins that were predicted to have a GPI anchor by

FragAnchor were further classified as GPI-anchored

proteins. Protein sequences predicted to have a signal

peptide and a GPI anchor may attach to the outer

leaflet of the plasma membrane or be secreted

becoming components of the cell wall. These proteins

are involved in signaling, adhesion, stress response,

and cell wall remodeling or play other roles in growth

and development (Borner et al., 2002; Borner et al.,

2003; Gillmor et al., 2005; Simpson et al., 2009).

Proteins in other subcellular locations: Other

subcellular locations including cytosol (cytoplasm),

cytoskeleton, Golgi apparatus, lysosome, nucleus,

peroxisome, plasma membrane and vacuole were

predicted by WoLF PSORT.

1.3 Computational prediction accuracies of protein

subcellular locations

The prediction methods we used above were

developed based on our previous evaluation of

computational tools (Min, 2010; Meinken and Min,

2012; Melhem et al., 2013). To estimate the prediction

accuracies of our methods for each subcellular

location we used two datasets (Table 1). Dataset A

consists of 15 028 proteins. This dataset contains

proteins from the UniProtKB/Swiss-Prot dataset with

a curated subcellular location. Proteins having

multiple subcellular locations or labeled as “fragment”

were excluded. Dataset B consist of 6 908 proteins

which were generated from Dataset A after excluding

entries having a term of “by similarity” or “probable”

or “predicted” in subcellular location annotation. In

comparing with other methods using a single tool, our

method - i.e. using a combination of multiple tools

Computational

Molecular Biology