Computational Molecular Biology
            
            
              4
            
            
              however, is available for all plant proteins. Some of
            
            
              the proteins may have more than one subcellular
            
            
              location. The following criteria are applied for
            
            
              computational classification of protein subcellular
            
            
              locations:
            
            
              Membrane proteins: A protein predicted to contain one
            
            
              or more transmembrane domains by TMHMM is
            
            
              classified as a membrane protein. However, if there is
            
            
              only one transmembrane domain predicted and that is
            
            
              located within the N-terminus 70 amino acids, and
            
            
              also a signal peptide is predicted by SignalP 4.0, this
            
            
              protein is not counted as a membrane protein.
            
            
              Chloroplast proteins: A protein predicted as “C” (for
            
            
              chloroplast) for subcellular location by TargetP is
            
            
              classified as a chloroplast protein. If it is also
            
            
              classified as a membrane protein, then it is further
            
            
              classified as chloroplast membrane protein.
            
            
              Mitochondrial proteins: A protein predicted as “M”
            
            
              (for mitochondrial) for subcellular location by TargetP
            
            
              is classified as a mitochondrial protein. If it is also
            
            
              classified as a membrane protein, then it is further
            
            
              classified as mitochondrial membrane protein.
            
            
              ER proteins: Proteins predicted to contain a signal
            
            
              peptide by SignalP 4.0 and an ER target signal
            
            
              (Prosite: PS00014) by PS-Scan were treated as
            
            
              luminal ER proteins.
            
            
              Complete secretomes: A secretome is all secreted
            
            
              proteins from a species. Only proteins that are
            
            
              predicted to have a secretory signal peptide by all
            
            
              three predictors - SignalP 4.0, Phobius, and TargetP -
            
            
              and that are not classified as any of the above
            
            
              categories are included in the secreteome. However,
            
            
              proteins that are not classified as any of the above
            
            
              categories and are predicted to have a signal peptide
            
            
              by one or two of the predictors are assigned as
            
            
              “weakly likely secreted” or “likely secreted” as our
            
            
              previous evaluation revealed that a signal peptide in
            
            
              some annotated secreted proteins can only be detected
            
            
              by one or two predictors (Lum and Min, 2011a).
            
            
              Using all three predictors, which increases the
            
            
              specificity of secretome prediction,
            
            
              improves
            
            
              prediction accuracy (Min, 2010; Melhem et al., 2013).
            
            
              All manually curated secreted and extracellular
            
            
              proteins are included in the complete secretomes.
            
            
              Curated secreted proteins: This category includes
            
            
              proteins which are annotated to be “secreted” or
            
            
              “extracellular” or “cell wall” in the subcellular
            
            
              location from the UniProtKB/Swiss-Prot data set
            
            
              which are “reviewed”. It also includes manually
            
            
              collected secreted proteins from recent literature by
            
            
              our curators.
            
            
              GPI-anchored proteins: Signal peptide containing
            
            
              proteins that were predicted to have a GPI anchor by
            
            
              FragAnchor were further classified as GPI-anchored
            
            
              proteins. Protein sequences predicted to have a signal
            
            
              peptide and a GPI anchor may attach to the outer
            
            
              leaflet of the plasma membrane or be secreted
            
            
              becoming components of the cell wall. These proteins
            
            
              are involved in signaling, adhesion, stress response,
            
            
              and cell wall remodeling or play other roles in growth
            
            
              and development (Borner et al., 2002; Borner et al.,
            
            
              2003; Gillmor et al., 2005; Simpson et al., 2009).
            
            
              Proteins in other subcellular locations: Other
            
            
              subcellular locations including cytosol (cytoplasm),
            
            
              cytoskeleton, Golgi apparatus, lysosome, nucleus,
            
            
              peroxisome, plasma membrane and vacuole were
            
            
              predicted by WoLF PSORT.
            
            
              
                1.3 Computational prediction accuracies of protein
              
            
            
              
                subcellular locations
              
            
            
              The prediction methods we used above were
            
            
              developed based on our previous evaluation of
            
            
              computational tools (Min, 2010; Meinken and Min,
            
            
              2012; Melhem et al., 2013). To estimate the prediction
            
            
              accuracies of our methods for each subcellular
            
            
              location we used two datasets (Table 1). Dataset A
            
            
              consists of 15 028 proteins. This dataset contains
            
            
              proteins from the UniProtKB/Swiss-Prot dataset with
            
            
              a curated subcellular location. Proteins having
            
            
              multiple subcellular locations or labeled as “fragment”
            
            
              were excluded. Dataset B consist of 6 908 proteins
            
            
              which were generated from Dataset A after excluding
            
            
              entries having a term of “by similarity” or “probable”
            
            
              or “predicted” in subcellular location annotation. In
            
            
              comparing with other methods using a single tool, our
            
            
              method - i.e. using a combination of multiple tools
            
            
              Computational
            
            
              Molecular Biology