Computational Molecular Biology 2016, Vol.6, No.4, 1-12
5
nucleus, peroxisome, plasma membrane, and vacuole are shown in Table 1c. Proteins for the cytoplasm subset
also include cytosol as these two terms are used interchangeably in the UniProtKB annotation. The annotated
cytoskeleton entries are also annotated as cytoplasm in UniProtKB. However, in our evaluation cytoskeleton
proteins were not counted in the subset of cytoplasm. We would also like to point out that plasma membrane
proteins were annotated as “cell membrane” in UniProtKB, thus cell membrane proteins were retrieved for
evaluating the category of plasma membrane. The prediction accuracies for these subcellular locations vary
significantly. Predictions of proteins located in cytoplasm, cytoskeleton, and nucleus were relatively accurate with
a MCC value of 0.54, 0.46 and 0.49, respectively. The specificities for cytoskeleton, ER, and peroxisome
predictions were high (> 98%), but the sensitivities were low (< 50%). There were no positives predicted for
proteins localized in Golgi, lysosome or vacuoles. These results showed there is a need to train the predictors with
protist-specific proteins for protist protein subcellular location prediction.
3.2 Overview of subcellular proteome distribution in different species
ProtSecKB contains a total of 1.97 million protein sequences generated from 7,024 protist species including 101
unique species with some of them having multiple strains totaling 127 organisms with complete proteomes. The
main categories of subcellular proteomes - including highly likely secreted and likely secreted, cytoplasm, plasma
membrane, mitochondrial, and nuclear proteins - for species having complete proteomes are summarized in Table
2. Curated secreted proteins, ER proteins, etc. are not included but can be obtained from the website mentioned
above in the Data section (Supplementary Table 1). There are not many proteins with curated subcellular locations
in protist species. The curated secreted proteins were mainly from
D. discoideum
with 113 proteins.
D.
discoideum
is a soil-living amoeba belonging to the phylum Amoebozoa and commonly referred to as slime mold
(Bakthavatsalam and Gomer, 2010). We also curated 29 secreted proteins in
P. falciparum
, a protozoan parasite
causing malaria in humans (Singh et al., 2009; Soni et al., 2016).
The species in Protista kingdom have quite variable proteome sizes - from about 5000 proteins in
P. falciparum
to
over 50,000 in
Trypanosoma cruzi
, a parasitic euglenoid protozoan causing Chagas' disease in humans (Bern et al.
2011) (Table 2). The distribution of subcellular proteomes varied tremendously in different species, with nucleus,
cytoplasm, mitochondria representing the larger subcellular compartments. There were from 14.4% to 77.0%
proteins located in the nucleus, from 7.4% to 40.3% in mitochondria, from 4.6% to 35.4% in cytoplasm, and 0.8%
to 15.2% secreted. On average for all protist species with complete proteomes, approximately 44% of proteins
were located in the nuclear compartment, 22% in mitochondria, 17% in cytoplasm, and 6% secreted outside the
plasma membrane of the cell (Table 2).
3.3 Comparative protein family analysis of protist secretomes
Complete comparative evolutionary analyses of protist secretomes or other sub-proteomes were beyond the scope
of this study. As complete secretome or other sub-proteome sequences can be downloaded directly from our
database, researchers with their specific aims can carry out further detailed comparative study of these
sub-proteomes in different species of their interest. However, we performed an rpsBLAST search against the Pfam
database for all predicted curated secreted, highly likely secreted and likely secreted proteins (Supplementary
Table 2). Here we only included Pfams of the secretomes from the highly likely secreted and curated secreted
protein sets of three species to demonstrate the functional diversities of the secreted proteins in protists (Table 3).
The three species were
D. discoideum
, a soil-living amoeba;
P. infestans
, a plant pathogen; and
T. cruzi
, a human
parasite.
D. discoideum
had 832 secreted proteins with 388 of them with a Pfam,
P. infestans
had 1,748 secreted
proteins with 583 of them with a Pfam, and
T. cruzi
had 4,122 secreted proteins with 1,599 of them with a Pfam
(Table 3). The distribution of protein families having at least 6 members in each family was listed in Table 3 and a
complete list of data can be downloaded (Supplementary Table 3). In different protist species, not only the total
numbers of secreted proteins were different but also the categories of protein families as well as the number of
members in each family were vastly different (Table 3).