In the input form, select an appropriate button for the source origin of your sequence and paste that sequence in the field. Click the "Submit" button, and you will get the output of PSORT. Its first part is the summary of your input for confirmation. The rest is the result of analyzing various sequence features related to protein sorting signals. Calculations are conveniently divided into two reasoning steps. The conclusive prediction result, i.e., the top 5 probable localization sites with their certainty factors is given finally.
The form-fill feature is used to specify input information. Thus, a WWW browser that supports this feature, such as NCSA Mosaic 2.x, is needed.
Select one of the radio buttons to specify the source origin of the input sequence. This selection determines the candidate localization-sites for prediction as listed below:
If specified, the sequence ID will be embedded in the result. Only the first
word of input is used and there is no restriction on its format. Default ID is
"MYSEQ".
Enter the sequence here by direct typing or by the copy & paste feature of any
window systems. Characters except standard one-letter code for 20 amino
acids, e.g., spaces, numeric, and carriage returns, will be removed off by the
system. Small cases will be changed to capital cases.
The input sequence is expected to be a direct translation from the
genetic information and to contain all information for sorting. Thus, a
warning message will be issued if it starts by an amino acid except M
(methionine).
Sequence Field
Output for Bacterial Sequences
In the current version, programs and parameters are the same for both kinds
of bacteria. The inner membrane in Gram-negative bacteria is thought to be
equivalent to the membrane of Gram-positive bacteria. And the outside in
Gram-positive ones is further divided into either the periplasm or the outer
membrane in Gram-negative ones.
In Gram-negative bacteria, most periplasmic and outer membrane proteins
have a signal sequence (also called a leader peptide) in the N-terminus,
which is cleaved off after the translocation of the cytoplasmic membrane.
Some of the cytoplasmic membrane proteins also have cleavable signal
sequences but some N-terminal signal sequences in the cytoplasmic
membrane proteins are not cleaved off, remaining as transmembrane
segments.
PSORT first predicts the presence of signal sequences by McGeoch's
method (D. J. McGeoch, Virus Research, 3, 271, 1985) modified by
Nakai and Kanehisa, 1991.
It considers the N-terminal basically-charged region (CR)
and the central hydrophobic region (UR) of signal sequences. A discriminant
score is calculated from the three values: length of UR, peak value of UR, and
net charge of CR. These results are summarized in "McG". A large positive
discriminant score means a high possibility to possess a signal sequence
whether it is cleaved off or not.
Next, PSORT applies von Heijne's method of signal sequence
recognition (G. von Heijne, Nucl. Acids Res., 14, 4683, 1986). It is a weight-
matrix method and incorporates the information of consensus pattern
around the cleavage sites (the (-3,-1)-rule) and thus it can be used to detect
uncleavable signal sequences. The output score of this "GvH" is the original
weight-matrix score (for prokaryotes) subtracted by 7.5. A large positive
output means a high possibility that it has a cleavable signal sequence. The
position of possible cleavage site, i.e., the most C-terminal position of a signal
sequence, is also reported.
In general, hydrophobic transmembrane segments exist in the cytoplasmic
membrane proteins only. Thus, these segments can be regarded as the
sorting signal into the cytoplasmic membrane.
PSORT employs Klein et al.'s method ("ALOM", also called as KKD) to
detect potential transmembrane segments (P. Klein, M. Kanehisa, and C.
DeLisi, Biochim. Biophys. Acta, 815, 468, 1985). It attempts to identify the
most probable transmembrane segment from the average hydrophobicity
value of 17-residue segments, if any. It predicts whether the segment is a
transmembrane segment (INTEGRAL) or not (PERIPHERAL) comparing
the discriminant score (reported as 'value') with a threshold parameter pre-
defined to 0.0 for bacteria ('threshold'). For an integral membrane protein,
position(s) of transmembrane segment(s) are also reported. Their length is
fixed to 17 but their extension, i.e., the maximal range that satisfies the
discriminant criterion, is also given in parentheses. The discrimination step
mentioned above is continued after leaving out the segment till there remains
no predicted transmembrane segment. The item 'count' is the number of
predicted transmembrane segments.
The signal sequence of lipoproteins, i.e., proteins with a covalently attached
lipid molecule in their mature N-terminus, are essentially the same as those
of usual proteins except the region around their cleavage sites. Thus, they
can be recognized by the combination of McGeoch's method and the
consensus motif around the cleavage site formulated by von Heijne (G. von
Heijne, Protein Eng., 2, 531, 1989). The program is named as "Lipop" here.
It gives the possible modification site around the end position of preceding CR
region defined in McGeoch's method for a probable lipoprotein; otherwise, it
returns a dummy modification site, -1.
Since the N-terminal lipid moieties of lipoproteins are thought to be
integrated into membranes, they are predicted to be membrane-associated
proteins. Further discrimination between the cytoplasmic membrane or the
outer membrane is done as follows based on the experiment of Yamaguchi et
al. (K. Yamaguchi, F. Yu, and M. Inoue, Cell, 53, 423, 1988): If a lipoprotein
has a negatively charged residue at the second or third position of the mature
part, it is sorted to the inner membrane; otherwise, it is sorted to the outer
membrane.
Although outer membrane proteins are integrated into the membrane, they
do not have any hydrophobic segments which characterize usual integral
membrane proteins. It is likely because their membrane-spanning parts
consist of b strands. In addition, the sorting signal which discriminates
outer membrane proteins from periplasmic proteins is not well
characterized. Therefore, PSORT uses the information of amino acid
composition of the predicted mature portion for their discrimination (
Nakai and Kanehisa, 1991)
considering the N-terminal signal sequence. That is, a
discriminant score is calculated from the linear combination of the
percentage of 10 amino acids. Its large positive value means the tendency to
be an outer membrane protein.
In this version of PSORT, parameters for analyzing yeast or plant sequences
are almost the same with parameters for animal sequences. Yeast and plant
have a candidate site vacuole instead of lysosome in animal. In yeast, the
consensus sequence for ER-lumen retention is HDEL rather than KDEL in
others. Lastly, plants have chloroplasts (stroma etc.) as extra-candidates.
In eukaryotes, proteins sorted through the so-called vesicular pathway (bulk
flow) usually have a signal sequence (also called a leader peptide) in the N-
terminus, which is cleaved off after the translocation through the ER
membrane. Some N-terminal signal sequences are not cleaved off,
remaining as transmembrane segments but it does not mean these proteins
are retained in the ER; they can be further sorted included in vesicles.
PSORT first predicts the presence of signal sequences by McGeoch's
method (D. J. McGeoch, Virus Research, 3, 271, 1985) modified by
Nakai and Kanehisa, 1991.
It considers the N-terminal basically-charged region (CR)
and the central hydrophobic region (UR) of signal sequences. A discriminant
score is calculated from the three values: length of UR, peak value of UR, and
net charge of CR. These results are summarized in "McG". A large positive
discriminant score means a high possibility to possess a signal sequence
whether it is cleaved off or not.
Next, PSORT applies von Heijne's method of signal sequence
recognition (G. von Heijne, Nucl. Acids Res., 14, 4683, 1986). It is a weight-
matrix method and incorporates the information of consensus pattern
around the cleavage sites (the (-3,-1)-rule) and thus it can be used to detect
uncleavable signal sequences. The output score of this "GvH" is the original
weight-matrix score (for eukaryotes) subtracted by 3.5. A large positive output
means a high possibility that it has a cleavable signal sequence. The position
of possible cleavage site, i.e., the most C-terminal position of a signal
sequence, is also reported.
The current version of PSORT assumes that all integral membrane proteins
have hydrophobic transmembrane segment(s) which are thought to be alpha-
helices in membranes.
PSORT employs Klein et al.'s method ("ALOM", also called as KKD) to
detect potential transmembrane segments (P. Klein, M. Kanehisa, and C.
DeLisi, Biochim. Biophys. Acta, 815, 468, 1985). It attempts to identify the
most probable transmembrane segment from the average hydrophobicity
value of 17-residue segments, if any. It predicts whether the segment is a
transmembrane segment (INTEGRAL) or not (PERIPHERAL) comparing
the discriminant score (reported as 'value') with a threshold parameter pre-
defined to 0.0 for bacteria ('threshold'). For an integral membrane protein,
position(s) of transmembrane segment(s) are also reported. Their length is
fixed to 17 but their extension, i.e., the maximal range that satisfies the
discriminant criterion, is also given in parentheses. The discrimination step
mentioned above is continued after leaving out the segment till there remains
no predicted transmembrane segment. The item 'count' is the number of
predicted transmembrane segments.
However, the ALOM program, which has been ranked as one of the best
methods for evaluation, is not sufficient to predict the exact number of
transmembrane segments of polytopic, i.e., multiple membrane-spanning,
proteins. Thus, we used two threshold values for more precise prediction of
eukaryotic membrane proteins: when predicted to be a polytopic, protein, a
less stringent value was employed for the prediction of more realistic number
of transmembrane segments. It seems probable that once integrated into the
membrane, less hydrophobic segments are also integrated into it.
Membrane proteins have their spcefic way to integrate into the membrane in
respect to the two sides (cytoplasmic or exo-cytoplasmic), which is called as
membrane toplogy. We used Singer's classification for membrane topology
(S. J. Singer, Ann. Rev. Cell Biol., 6, 247, 1990). Prediction of membrane
topology is important because some sorting signals reside in specific positions
in specific topologies, e.g., cytoplasmic tail (see below).
PSORT uses Hartmann et al.'s method (E. Hartmann, T. A. Rapoport,
and H. F. Lodish, Proc. Natl. Acad. Sci. USA, 86, 5786, 1989); called "MTOP"
here) for the prediction of membrane topology, which assumes that the
overall topology is determined from the net charge difference of both sides of
15 residues flanking the most N-terminal transmembrane segment. In the
outpu, 'I(middle)' means the central position of the most N-terminal
segment.
Since the N-terminal transmembrane segments of type Ib proteins
were often wrongly predicted to be cleaved off by von Heijne's method, we
introduced the hypothesis that if the charge difference of the most N-terminal
transmembrane segment is reversed to that of usual ER-transferons, it is not
cleaved. Since some cleavable ER-transferons had a reversed charge
difference, we had to change the originally reported threshold value. PSORT
also uses a heuristic that transmembrane segments of many type II proteins
reside apart from the N-terminus to some degree.
In addition, there seems to be a preference of membrane topology in
each localization site. For example, type Ib proteins are favored at the ER
while type II tend towards the Golgi complex and the plasma membrane.
PSORT uses such empirical knowledge for prediction.
In mitochondria, many proteins are sorted through a 'conservative' pathway
while others are sorted through 'nonconservative' pathways from the
cytoplasm. The proteins sorted through the former have mitochondrial
matrix targeting signals in their N-terminus. On the contrary, sequence
features of protein sorting signals with 'nonconservative' pathways are
hardly recognizable.
PSORT employs a simple method to recognize mitochondrial targeting
signals using the discriminant analysis from values of partial amino acid
composition
(Nakai and Kanehisa, 1992).
For example, the arginine content
turned out to be effective for prediction. PSORT also reports some consensus
sequence patterns around cleavage sites (the item "Gavel" from Y. Gavel and
G. von Heijne, Prot. Eng., 4, 33, 1990). However, the result is not used in our
prediction.
Proteins targeted to the mitochondrial intermembrane space via the
'conservative' pathway, have an N-terminal signal of bipartite structure: its
N-terminal half appears to be essentially a mitochondrial targeting signal
and its C-terminal half is the signal for the translocation from the matrix to
the intermembrane space. PSORT recognizes the N-terminal halves by the
above-mentioned discriminant analysis. As for the C-terminal halves,
PSORT uses an original method for the detection of apolar segments
("APOLAR").
Since only a few mitochondrial outer membraneproteins have been
sequenced, its prediction result can not have general applicability. Many
proteins localized at the mitochondrial inner membrane are likely to be
peripheral membrane proteins which exist as members of large membrane
complexes and their degree of hydrophobicity is relatively low compared with
membrane proteins in the vesicular pathway. Thus, although PSORT uses
the ALOM program for detecting them, it awaits further improvement.
Although it seems possible that a protein without its own nuclear targeting
signal enters the nucleus via cotransport with a protein that has one, many
nuclear proteins have their own targeting signals. Their most common type
is that of SV40 large T antigen. PSORT uses the following two rules to detect
it: 4 residue pattern composed of basic amino acids (K or R), or composed of
three basic amino acids (K or R) and H or P; a pattern starting with P and
followed within 3 residues by a basic segment containing 3 K or R residues
out of 4 residues.
Another type of nuclear targeting signal is the type of Xenopus
nucleoplasmin proposed by Robbins et al. (J. Robbins, S. M. Dilworth, R. A.
Laskey, and C. Dingwall, Cell, 64, 615, 1991). The pattern is: 2 basic residues,
10 residue spacer, and another basic region consisting of at least 3 basic
residues out of 5 residues.
PSORT used a heuristic that nuclear proteins are generally rich in
basic residues: If the sum of K and R compositions are higher than 20%, then
the protein is considered to have higher possibility of being nuclear than
cytoplasmic. In addition, it also examines the presence of RNP
(ribonucleoprotein) consensus motif because some RNPs are transported to
the nucleus by signals existing in the bound RNAs. However, it is apparently
insufficient for actual prediction.
Note that we classify ribosomal proteins as nuclear proteins because
they have nuclear targeting signals and are once transported into the
nucleus.
Peroxisomes, sometimes called glyoxisomes, glycosomesare, or microbodies,
are organelles found in almost every eukaryotic cell. As a sorting signal, the
importance of the C-terminal three residues, (S/A(/C))(K/R/H)L, has been
indicated (the SKL motif). However, since many peroxisomal proteins do not
have this motif at the appropriate position, PSORT uses a heuristic that the
presence of this motif at other positions also implicates the peroxisomal
localization.
Although some peroxisomal proteins have N-terminal presequences
which are cleaved off after translocation, it is not clear whether they are
sorting signals. According to our preliminary analysis, the amino acid
composition of the N-terminal 20 residues were not very effective as variables
of discriminant analysis. Then, the amino acid composition of the entire
sequence is used for supplemental information for prediction.
The sorting signal of peroxisomal membrane proteins is not known.
Our training data of peroxisomal proteins contained a 70 K membrane
protein. It was unclear whether our rule could also be applied to this protein,
but it had three internal SKL motifs and was positive with the discriminant
score although this protein was not included in the derivation of the function.
Proteins targeted to chloroplasts have cleavable signals in the N-terminus,
the chloroplast (stroma) targeting signals. PSORT postulates that all stromal
proteins and thylakoid membrane proteins have this kind of signal. It uses a
discriminant score calculated from partial amino acid compositions
(positions 3-10 and 1-30) and from the amplitude of maximum hydrophobic
moment of 165 degrees (potential beta-structure) for residues 25 to 70
(Nakai and Kanehisa, 1992).
The form of discriminat function shows the abundance of
alanine and serine residues in the N-terminal 30 residues. In addition, the
observation that the second residue is often alanine is also used.
Like some mitochondrial proteins, proteins of chloroplast thylakoid
lumen have a bipartite signal in their N-terminus. Its N-terminal half is
essentially the same as a stroma targeting signal and the C-terminal half is
used for the translocation from the stroma to the thylakoid lumen. For the
detection of latter signal, another clue, PSORT uses both the result of
APOLAR algorithm applied to the limited region of residues 40 to 90 and a
weight matrix score around the cleavage sites (C. J. Howe and T. P. Wallace,
Nucl. Acids Res., 18, 3417, 1990).
Thylakoid membrane proteins were discriminated by ALOM. The
remainder of chloroplast proteins are tentatively regarded as stromal
proteins.
PSORT postulates that the proteins with N-terminal signal sequence will be
transported to the cell surface by default unless they have any other signals
for specific retention or commitment; a luminal protein will be secreted
constitutively to the extracellular space and a membrane protein will reside
at the plasma membrane.
The retention signal of ER luminal proteins from the bulk flow is the
existence of the sequence motif KDEL in the C-terminus. In yeast and some
plants, the consensus motif is HDEL. Although some variations of this motif
are allowed in some organisms and cell types, they were not required for the
discrimination of our current data.
Compared with the KDEL motif, retention signal(s) for ER membrane
proteins seems less evident as a sequence motif: in one analysis using
mutagenesis, two lysines positioned three and four or five residues from the
C-terminus turned out to be important in some type Ia proteins (M. R.
Jackson, T. Nilsson, and P. A. Peterson, EMBO J., 9, 3153, 1990). This is one
example of various comparton signals existing in cytoplasmic tails (see
below). However, many ER membrane proteins do not have this kind of
sequence motif. The preference of membrane topology was a rather useful
clue.
As already exemplified above, many sorting signals in the membrane
proteins have been found in cytoplasmic tails which are short terminal
segments exposed to the cytoplasm in type Ia, Ib, and II proteins (in Singer's
terminology).
In relation to the default pathway of secretion, there is a pathway for
protein internalization through coated-pit mediated endocytosis. Two
sequence motifs, NPXY and YXRF, have been identified as signals for this
rapid internalization process. PSORT uses these sequence motifs as clues
identifying plasma membrane proteins.
PSORT also uses a proposed consensus motif, (S/T)X(E/Q)(R/K), near
the probable transmembrane domain of all Golgi-localized
glycosyltransferases (B. Bendiak, Biochem. Biophys. Res. Comm., 170, 879, 1990) in addition to the above-mentioned heuristic on membrane topology.
The protein modification reactions which bind lipid molecules to proteins are
important because a linked lipid moiety can be integrated into various
membranes and can anchor the bound protein.
For example, myristoylations occur at the consensus sequence in the
N-terminal 9 residues. However, recent studies suggest that many of them
may not take part in the direct anchoring. Thus, PSORT does not use the
result for further reasoning although the observation will be reported.
In contrast, all proteins linked to the glycosyl-phosphatidylinositol
(GPI) molecules are thought to be anchored at the extracellular surface of the
plasma membrane. PSORT recognizes GPI-anchored proteins by the
knowledge that most of them are predicted to be type Ia membrane proteins
with very short cytoplasmic tail (within 10 residues) and uses the result for
the prediction of the localization site (plasma membrane) of the modified
protein.
Lastly, there is a lipid modification known as isoprenylation or
farnesylation. This modification requires a CaaX motif in the C-terminus,
where 'a' denotes an aliphatic amino acid. Isoprenylated proteins have been
found in the plasma membrane and the nuclear envelope. PSORT recognizes
isoprenylated proteins by the motif and an additional rule that they do not
have any transmembrane segments nor signal sequences.
Lysosomes are acidic organelles that contain numerous hydrolytic enzymes.
In yeast and plant cells, similar functions are recognized in vacuoles, which
have diverse functions.
For soluble lysosomal proteins, the pathway which utilizes the post-
translational modification of mannose 6-phosphate has been clarified.
However, there are no clear consensus patterns except for the NX(S/T)
pattern necessary for N-glycosylation, likely because the modification is
conformation-dependent. Since the prediction of protein conformation is very
difficult, PSORT uses the discriminant score based on amino acid
composition.
It is likely that yeast and most plant cells share part of their sorting
mechanism. Many of them have signal sequences in their N-terminus and
have pro regions that are cleaved off after translocation. Nevertheless, no
common sequence features have been observed. Again, PSORT uses the
information of amino acid composition for discrimination The amino acid
composition of lysosomal and vacuolar soluble proteins turned out to be totally
different.
The sorting mechanism of lysosomal membrane proteins seems
different from that of lysosomal luminal proteins. The existence of a GY
motif within 17 residues from the membrane boundary in the cytoplasmic
tails of type Ia proteins is used as a rule for discrimination.
The whole system is organized as an expert system with a knowledge-base
which is a collection of 'if-then'-type rules. An expert system is an artificial
intelligence technique in which computers are equipped with domain specific
knowledge. The core part of our system was written in the programming
language OPS83 and calculations involving sequence data are written in the
C language and are called from rules when necessary. Currently, about 100
core rules are stored in the knowledge base.
Current version of PSORT correctly classifies 83% of the 106 Gram-negative
proteins into one of the four localization sites. However, the prediction
accuracy when applied to unknown sequences has not been estimated.
Of the 295 eukaryotic proteins used for the tuning of our system, 66%
were correctly discriminated. Moreover, of the 106 proteins selected randomly
from the localization sites including more than 10 members for testing, 59%
were correctly predicted. Many falsely predicted proteins seemed to be
transported by specific pathways.
The prediction accuracy will be certainly improved by incorporating the
future accumulation of our knowledge.
Recognition of Signal Sequence
Recognition of Transmembrane Segments
Analysis of Lipoproteins
Analysis of Amino Acid Composition
Output for Eukaryotic Information
Yeast, Animal, or Plant
Recognition of Signal Sequence
Recognition of Transmembrane Segments
Prediction of Membrane Topology
Recognition of Mitochondrial Proteins
Recognition of Nuclear Proteins
Recognition of Peroxisomal Proteins
Recognition of Chloroplast Proteins
Recognition of ER (endoplasmic reticulum) Proteins
Analysis of Proteins in Vesicular Pathway
Lipid Anchors
Lysosomal and Vacuolar Proteins
Notes on the Knowledge-based System
OPS83
Reliability of Prediction Result
nakai@imcb.osaka-u.ac.jp