Caution:

Some of the knowledge on protein sorting described in this document is obsolete now. See the help file for PSORT II. K.Nakai
                    PSORT Users' Manual

                        Kenta Nakai
      Human Genome Center, Institute of Medical Science
                    University of Tokyo
    4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan
                   Tel.: +81-3-5449-5619
                   Fax.: +81-3-5449-5434
                   e-mail: knakai@ims.u-tokyo.ac.jp

E-Mail Server version (date of last revision: Dec. 1, 1994)

I. INTRODUCTION -------------------------------------------

What Is PSORT?

PSORT is a computer program (expert system) for the prediction 
of protein localization sites in cells.  It receives the 
information of an amino acid sequence and its source origin, 
e.g., Gram-negative bacteria, as inputs. Then, the system 
analyzes the input sequence by applying the stored rules for 
various sequence features of known protein sorting signals. 
Then, it reports the possibility for the input protein to be 
localized at each candidate site with additional information.


Quick Start

To access the server, prepare an electronic mail message 
containing a properly formatted request such as:

     SOURCE animal
     BEGIN
     >MYSEQ
     nakainakainakainakainakainakainakainakai
     kentakentakentakenta

Send it to the following Internet address:

     psort@nibb.ac.jp

Then, you will receive the result by e-mail.  Its first part is 
the summary of your input for confirmation.  The rest is the 
result of analyzing various sequence features related to 
protein sorting signals.  Calculations are conveniently divided 
into two reasoning steps.  The conclusive prediction result, 
i.e., the top 5 probable localization sites with their 
certainty factors is given finally.


Obtaining Help

To obtain help document on using the PSORT e-mail server, send 
a mail message to the address above containing the word "help" 
on a single line.  This document is then returned to you in a 
mail message.

Further information can be obtained by accessing the PSORT WWW 
server.  The URL is http://psort.nibb.ac.jp.

I also welcome bug-reports and comments (nakai@nibb.ac.jp).


Citation

Please cite one of the following references when you make use 
of the result of this server:

For eukaryotic data:
     Nakai, K. and Kanehisa, M.,
     A knowledge base for predicting protein localization
     sites in eukaryotic cells, Genomics 14, 897-911 (1992).

For prokaryotic data:
     Nakai, K. and Kanehisa, M.,
     Expert system for predicting protein localization sites
     in Gram-negative bacteria, PROTEINS: Structure,
     Function, and Genetics 11, 95-110 (1991).


II. INPUT INFORMATION --------------------------------------

Formatting a Query

Queries consist of a mail message with search parameters 
identifying the source origin of the query sequence and the 
sequence itself.  Thus, the body of the mail message has two 
mandatory lines in the defined order: the SOURCE line which 
specifies the category of the source origin and the BEGIN line 
which is followed on the next line with the query sequence, as 
explained in next sections.


The SOURCE line

Select one of five categories to specify the source origin of 
the input sequence.  This selection determines the candidate 
localization-sites for prediction as listed below:

Gram-positive (bacterium):
     (cytoplasmic) membrane, cytoplasm, and outside, i.e., the
     protein will be secreted;

Gram-negative (bacterium):
     cytoplasm, inner membrane, periplasm, and outer membrane;

yeast:
     cytoplasm, mitochondria (outer membrane, intermembrane
     space, inner membrane, and matrix space), microbody
     (peroxisome), nucleus, endoplasmic reticulum, abbreviated
     as ER, (lumen and membrane), Golgi body, vacuole, plasma
     membrane, and outside;

animal:
     cytoplasm, mitochondria (outer membrane, intermembrane
     space, inner membrane, and matrix space), microbody
     (peroxisome), nucleus, endoplasmic reticulum (lumen and
     membrane), Golgi body, lysosome (lumen and membrane),
     plasma membrane, and outside;

plant:
     cytoplasm, mitochondria (outer membrane, intermembrane
     space, inner membrane, and matrix space), microbody
     (peroxisome), nucleus, endoplasmic reticulum (lumen and
     membrane), Golgi body, vacuole, plasma membrane, outside,
     and chloroplast (stroma, thylakoid membrane, and thylakoid
     space).


The Format of The Query Sequence

Only one query sequence is allowed per mail message and your 
sequence must be in so-called FASTA/Pearson format.  Namely, it 
includes a mandatory comment line beginning with a greater-than 
sign ">" followed by the name of the sequence, a space, and an 
optional note about the sequence.  The sequence data begin on 
the next line without the greater-than sign.  Characters except 
standard one-letter code for 20 amino acids, e.g., spaces, 
numeric, carriage returns and even X, will be removed off by 
the system.  The system is case-insensitive (lower cases will 
be changed to upper cases).  All lines of the sequence 
(including the description line) should be kept to 80 
characters or less in length.  Be careful not to include 
signature in the query mail.

The input sequence is expected to be a direct translation from 
the genetic information and to contain all information for 
sorting.  Thus, a warning message will be issued if it starts 
by an amino acid except M (methionine).


III. OUTPUT FOR BACTERIAL SEQUENCES --------------------------

Gram-positive or Gram-negative

In the current version, programs and parameters are the same 
for both kinds of bacteria.  The inner membrane in Gram-
negative bacteria is thought to be equivalent to the membrane 
of Gram-positive bacteria.  And the outside in Gram-positive 
ones is further divided into either the periplasm or the outer 
membrane in Gram-negative ones.


Recognition of Signal Sequence

In Gram-negative bacteria, most periplasmic and outer membrane 
proteins have a signal sequence (also called a leader peptide) 
in the N-terminus, which is cleaved off after the translocation 
of the cytoplasmic membrane.  Some of the cytoplasmic membrane 
proteins also have cleavable signal sequences but some N-
terminal signal sequences in the cytoplasmic membrane proteins 
are not cleaved off, remaining as transmembrane segments.

PSORT first predicts the presence of signal sequences by 
McGeoch's method (D. J. McGeoch, Virus Research, 3, 271 (1985)) 
modified by Nakai and Kanehisa, 1991.  It considers the N-
terminal basically-charged region (CR) and the central 
hydrophobic region (UR) of signal sequences.  A discriminant 
score is calculated from the three values: length  of UR, peak 
value of UR, and net charge of CR.  These results are 
summarized in "McG".  A large positive discriminant score means 
a high possibility to possess a signal sequence whether it is 
cleaved off or not.

Next, PSORT applies von Heijne's method of signal sequence 
recognition (G. von Heijne, Nucl. Acids Res., 14, 4683 (1986)).  
It is a weight-matrix method and incorporates the information 
of consensus pattern around the cleavage sites (the (-3,-1)-
rule) and thus it can be used to detect uncleavable signal 
sequences.  The output score of this "GvH" is the original 
weight-matrix score (for prokaryotes) subtracted by 7.5.  A 
large positive output means a high possibility that it has a 
cleavable signal sequence.  The position of possible cleavage 
site, i.e., the most C-terminal position of a signal sequence, 
is also reported.


Recognition of Transmembrane Segments

In general, hydrophobic transmembrane segments exist in the 
cytoplasmic membrane proteins only.  Thus, these segments can 
be regarded as the sorting signal into the cytoplasmic 
membrane.

PSORT employs Klein et al.'s method (ALOM, also called as KKD) 
to detect potential transmembrane segments (P. Klein, M. 
Kanehisa, and C. DeLisi, Biochim. Biophys. Acta, 815, 468 
(1985)).  It  attempts to identify the most probable 
transmembrane segment from the average hydrophobicity value of 
17-residue segments, if any.  It predicts whether the segment 
is a transmembrane segment (INTEGRAL) or not (PERIPHERAL) 
comparing the discriminant score (reported as 'value') with a 
threshold parameter pre-defined to 0.0 for bacteria 
('threshold').  For an integral membrane protein, position(s) 
of transmembrane segment(s) are also reported.  Their length is 
fixed to 17 but their extension, i.e., the maximal range that 
satisfies the discriminant criterion, is also given in 
parentheses.  The discrimination step mentioned above is 
continued after leaving out the segment till there remains no 
predicted transmembrane segment.  The item 'count' is the 
number of predicted transmembrane segments.


Analysis of Lipoproteins

The signal sequence of lipoproteins, i.e., proteins with a 
covalently attached lipid molecule in their mature N-terminus, 
are essentially the same as those of usual proteins except the 
region around their cleavage sites.  Thus, they can be 
recognized by the combination of McGeoch's method and the 
consensus motif around the cleavage site formulated by von 
Heijne (G. von Heijne, Protein Eng., 2, 531 (1989)).  The 
program is named as "Lipop" here.  It gives the possible 
modification site around the end position of preceding CR 
region defined in McGeoch's method for a probable lipoprotein; 
otherwise, it returns a dummy modification site, -1.

Since the N-terminal lipid moieties of lipoproteins are thought 
to be integrated into membranes, they are predicted to be 
membrane-associated proteins.  Further discrimination between 
the cytoplasmic membrane or the outer membrane is done as 
follows based on the experiment of Yamaguchi et al. (K. 
Yamaguchi, F. Yu, and M. Inoue, Cell, 53, 423 (1988)):  If a 
lipoprotein has a negatively charged residue at the second or 
third position of the mature part, it is sorted to the inner 
membrane; otherwise, it is sorted to the outer membrane.


Analysis of Amino Acid Composition

Although outer membrane proteins are integrated into the 
membrane, they do not have any hydrophobic segments which 
characterize usual integral membrane proteins.  It is likely 
because their membrane-spanning parts consist of b strands.  In 
addition, the sorting signal which discriminates outer membrane 
proteins from periplasmic proteins is not well characterized.  
Therefore, PSORT uses the information of amino acid composition 
of the predicted mature portion for their discrimination (Nakai 
and Kanehisa, 1991) considering the N-terminal signal sequence.  
That is, a discriminant score is calculated from the linear 
combination of the percentage of 10 amino acids.  Its large 
positive value means the tendency to be an outer membrane 
protein.


IV. OUTPUT FOR EUKARYOTIC SEQUENCES  -------------------------

Yeast, Animal, or Plant

In this version of PSORT, parameters for analyzing yeast or 
plant sequences are almost the same with parameters for animal 
sequences.  Yeast and plant have a candidate site vacuole 
instead of lysosome in animal.  In yeast, the consensus 
sequence for ER-lumen retention is HDEL rather than KDEL in 
others.  Lastly, plants have chloroplasts (stroma etc.) as 
extra-candidates.


Recognition of Signal Sequence

In eukaryotes, proteins sorted through the so-called vesicular 
pathway (bulk flow) usually have a signal sequence (also called 
a leader peptide) in the N-terminus, which is cleaved off after 
the translocation through the ER membrane.  Some N-terminal 
signal sequences are not cleaved off, remaining as 
transmembrane segments but it does not mean these proteins are 
retained in the ER; they can be further sorted included in 
vesicles.

PSORT first predicts the presence of signal sequences by 
McGeoch's method (D. J. McGeoch, Virus Research, 3, 271 (1985)) 
modified by Nakai and Kanehisa, 1991.  It considers the N-
terminal basically-charged region (CR) and the central 
hydrophobic region (UR) of signal sequences.  A discriminant 
score is calculated from the three values: length  of UR, peak 
value of UR, and net charge of CR.  These results are 
summarized in "McG".  A large positive discriminant score means 
a high possibility to possess a signal sequence whether it is 
cleaved off or not.

Next, PSORT applies von Heijne's method of signal sequence 
recognition (G. von Heijne, Nucl. Acids Res., 14, 4683 (1986)).  
It is a weight-matrix method and incorporates the information 
of consensus pattern around the cleavage sites (the (-3,-1)-
rule) and thus it can be used to detect uncleavable signal 
sequences.  The output score of this "GvH" is the original 
weight-matrix score (for eukaryotes) subtracted by 3.5.  A 
large positive output means a high possibility that it has a 
cleavable signal sequence.  The position of possible cleavage 
site, i.e., the most C-terminal position of a signal sequence, 
is also reported.


Recognition of Transmembrane Segments

The current version of PSORT assumes that all integral membrane 
proteins have hydrophobic transmembrane segment(s) which are 
thought to be a-helices in membranes.

PSORT employs Klein et al.'s method (ALOM, also called as KKD) 
to detect potential transmembrane segments (P. Klein, M. 
Kanehisa, and C. DeLisi, Biochim. Biophys. Acta, 815, 468 
(1985)).  It  attempts to identify the most probable 
transmembrane segment from the average hydrophobicity value of 
17-residue segments, if any.  It predicts whether the segment 
is a transmembrane segment (INTEGRAL) or not (PERIPHERAL) 
comparing the discriminant score (reported as 'value') with a 
threshold parameter pre-defined to 0.0 for bacteria 
('threshold').  For an integral membrane protein, position(s) 
of transmembrane segment(s) are also reported.  Their length is 
fixed to 17 but their extension, i.e., the maximal range that 
satisfies the discriminant criterion, is also given in 
parentheses.  The discrimination step mentioned above is 
continued after leaving out the segment till there remains no 
predicted transmembrane segment.  The item 'count' is the 
number of predicted transmembrane segments.

However, the ALOM program, which has been ranked as one of the 
best methods for evaluation, is not sufficient to predict the 
exact number of transmembrane segments of polytopic, i.e., 
multiple membrane-spanning,  proteins.  Thus, we used two 
threshold values for more precise prediction of eukaryotic 
membrane proteins: when predicted to be a polytopic, protein, a 
less stringent value was employed for the prediction of more 
realistic number of transmembrane segments.  It seems probable 
that once integrated into the membrane, less hydrophobic 
segments are also integrated into it.


Prediction of Membrane Topology

Membrane proteins have their specific way to integrate into the 
membrane in respect to the two sides (cytoplasmic or exo-
cytoplasmic), which is called as membrane topology.  We used 
Singer's classification for membrane topology (S. J. Singer, 
Ann. Rev. Cell Biol., 6, 247 (1990)).  Prediction of membrane 
topology is important because some sorting signals reside in 
specific positions in specific topologies, e.g., cytoplasmic 
tail (see below).

PSORT uses Hartmann et al.'s method (E. Hartmann, T. A. 
Rapoport, and H. F. Lodish, Proc. Natl. Acad. Sci. USA, 86, 
5786 (1989); called "MTOP" here) for the prediction of membrane 
topology, which assumes that the overall topology is determined 
from the net charge difference of both sides of 15 residues 
flanking the most N-terminal transmembrane segment.  In the 
output, 'I(middle)' means the central position of the most N-
terminal segment.

Since the N-terminal transmembrane segments of type Ib proteins 
were often wrongly predicted to be cleaved off by von Heijne's 
method, we introduced the hypothesis that if the charge 
difference of the most N-terminal transmembrane segment is 
reversed to that of usual ER-transferons, it is not cleaved.  
Since some cleavable ER-transferons had a reversed charge 
difference, we had to change the originally reported threshold 
value.  PSORT also uses a heuristic that transmembrane segments 
of many type II proteins reside apart from the N-terminus to 
some degree.

In addition, there seems to be a preference of membrane 
topology in each localization site.  For example, type Ib 
proteins are favored at the ER while type II tend towards the 
Golgi complex and the plasma membrane.  PSORT uses such 
empirical knowledge for prediction.


Recognition of Mitochondrial Proteins

In mitochondria, many proteins are sorted through a 
'conservative' pathway while others are sorted through 
'nonconservative' pathways from the cytoplasm.  The proteins 
sorted through the former have mitochondrial matrix targeting 
signals in their N-terminus.  On the contrary, sequence 
features of protein sorting signals with 'nonconservative' 
pathways are hardly recognizable.

PSORT employs a simple method to recognize mitochondrial 
targeting signals using the discriminant analysis from values 
of partial amino acid composition Nakai and Kanehisa, 1992.  
For example, the arginine content turned out to be effective 
for prediction.  PSORT also reports some consensus sequence 
patterns around cleavage sites (the item "Gavel" from Y. Gavel 
and G. von Heijne, Prot. Eng., 4, 33 (1990)).  However, the 
result is not used in our prediction.

Proteins targeted to the mitochondrial intermembrane space via 
the 'conservative' pathway, have an N-terminal signal of 
bipartite structure:  its N-terminal half appears to be 
essentially a mitochondrial targeting signal and its C-terminal 
half is the signal for the translocation from the matrix to the 
intermembrane space.  PSORT recognizes the N-terminal halves by 
the above-mentioned discriminant analysis.  As for the C-
terminal halves, PSORT uses an original method for the 
detection of apolar segments ("APOLAR").

Since only a few mitochondrial outer membrane proteins have been 
sequenced, its prediction result can not have general 
applicability.  Many proteins localized at the mitochondrial 
inner membrane are likely to be peripheral membrane proteins 
which exist as members of large membrane complexes and their 
degree of hydrophobicity is relatively low compared with 
membrane proteins in the vesicular pathway.  Thus, although 
PSORT uses the ALOM program for detecting them, it awaits 
further improvement.


Recognition of Nuclear Proteins

Although it seems possible that a protein without its own 
nuclear targeting signal enters the nucleus via cotransport 
with a protein that has one, many nuclear proteins have their 
own targeting signals.  Their most common type is that of SV40 
large T antigen.  PSORT uses the following two rules to detect 
it: 4 residue pattern composed of basic amino acids (K or R), 
or composed of three basic amino acids (K or R) and  H or P; a 
pattern starting with P and followed within 3 residues by a 
basic segment  containing 3 K or R residues out of 4 residues.

Another type of nuclear targeting signal is the type of Xenopus 
nucleoplasmin proposed by Robbins et al. (J. Robbins, S. M. 
Dilworth, R. A. Laskey, and C. Dingwall, Cell, 64, 615 (1991)).  
The pattern is: 2 basic residues, 10 residue spacer, and 
another basic region consisting of at least 3 basic residues 
out of 5 residues.

PSORT used a heuristic that nuclear proteins are generally rich 
in basic residues:  If the sum of K and R compositions are 
higher than 20%, then the protein is considered to have higher 
possibility of being nuclear than cytoplasmic.  In addition, it 
also examines the presence of RNP (ribonucleoprotein) consensus 
motif because some RNPs are transported to the nucleus by 
signals existing in the bound RNAs.  However, it is apparently 
insufficient for actual prediction.

Note that we classify ribosomal proteins as nuclear proteins 
because they have nuclear targeting signals and are once 
transported into the nucleus.


Recognition of Chloroplast Proteins

Proteins targeted to chloroplasts have cleavable signals in the 
N-terminus, the chloroplast (stroma) targeting signals.  PSORT 
postulates that all stromal proteins and thylakoid membrane 
proteins have this kind of signal.  It uses a discriminant 
score calculated from partial amino acid compositions 
(positions 3-10 and 1-30)  and from the amplitude of maximum 
hydrophobic moment of 165 degrees (potential b-structure) for 
residues 25 to 70 Nakai and Kanehisa, 1992.  The form of 
discriminant function shows the abundance of alanine and serine 
residues in the N-terminal 30 residues.  In addition, the 
observation that the second residue is often alanine is also 
used.

Like some mitochondrial proteins, proteins of chloroplast 
thylakoid lumen have a bipartite signal in their N-terminus.  
Its N-terminal half is essentially the same as a stroma 
targeting signal and the C-terminal half is used for the 
translocation from the stroma to the thylakoid lumen.  For the 
detection of latter signal, another clue, PSORT uses both the 
result of APOLAR algorithm applied to the limited region of 
residues 40 to 90 and a weight matrix score around the cleavage 
sites (C. J. Howe and T. P. Wallace, Nucl. Acids Res., 18, 3417 
(1990)).

Thylakoid membrane proteins were discriminated by ALOM.  The 
remainder of chloroplast proteins are tentatively regarded as 
stromal proteins.


Recognition of Peroxisomal Proteins

Peroxisomes, sometimes called glyoxisomes, glycosomesare, or 
microbodies, are organelles found in almost every eukaryotic 
cell.  As a sorting signal, the importance of the C-terminal 
three residues, (S/A(/C))(K/R/H)L, has been indicated (the SKL 
motif).  However, since many peroxisomal proteins do not have 
this motif at the appropriate position, PSORT uses a heuristic 
that the presence of this motif at other positions also 
implicates the peroxisomal localization.

Although some peroxisomal proteins have N-terminal presequences 
which are cleaved off after translocation,  it is not clear 
whether they are sorting signals.  According to our preliminary 
analysis, the amino acid composition of the N-terminal 20 
residues were not very effective as variables of discriminant 
analysis.  Then, the amino acid composition of the entire 
sequence is used for supplemental information for prediction.

The sorting signal of peroxisomal membrane proteins is not 
known.  Our training data of peroxisomal proteins contained a 
70 K membrane protein.  It was unclear whether our rule could 
also be applied to this protein, but it had three internal SKL 
motifs and was positive with the discriminant score although 
this protein was not included in the derivation of the 
function.


Recognition of ER (endoplasmic reticulum) Proteins

PSORT postulates that the proteins with N-terminal signal 
sequence will be transported to the cell surface by default 
unless they have any other signals for specific retention or 
commitment; a luminal protein will be secreted constitutively 
to the extracellular space and a membrane protein will reside 
at the plasma membrane.

The retention signal of ER luminal proteins from the bulk flow 
is the existence of the sequence motif KDEL in the C-terminus.  
In yeast and some plants, the consensus motif is HDEL.  
Although some variations of this motif are allowed in some 
organisms and cell types, they were not required for the 
discrimination of our current data.

Compared with the KDEL motif, retention signal(s) for ER 
membrane proteins seems less evident as a sequence motif:  in 
one analysis using mutagenesis, two lysines positioned three 
and four or five residues from the C-terminus turned out to be 
important in some type Ia proteins (M. R. Jackson, T. Nilsson, 
and P. A. Peterson, EMBO J,. 9, 3153 (1990)).  This is one 
example of various comparton signals existing in cytoplasmic 
tails (see below).  However, many ER membrane proteins do not 
have this kind of sequence motif.  The preference of membrane 
topology was a rather useful clue.


Analysis of Proteins in Vesicular Pathway

As already exemplified above, many sorting signals in the 
membrane proteins have been found in cytoplasmic tails which 
are short terminal segments exposed to the cytoplasm in type 
Ia, Ib, and II proteins (in Singer's terminology).

In relation to the default pathway of secretion, there is a 
pathway for protein internalization through coated-pit mediated 
endocytosis.  Two sequence motifs, NPXY and YXRF, have been 
identified as signals for this rapid internalization process.  
PSORT uses these sequence motifs as clues identifying plasma 
membrane proteins.

PSORT also uses a proposed consensus motif, (S/T)X(E/Q)(R/K), 
near the probable transmembrane domain of all Golgi-localized 
glycosyltransferases (B. Bendiak, Biochem. Biophys. Res. Comm., 
170, 879 (1990)) in addition to the above-mentioned heuristic 
on membrane topology .


Lipid Anchors

The protein modification reactions which bind lipid molecules 
to proteins are important because a linked lipid moiety can be 
integrated into various membranes and can anchor the bound 
protein.

For example, myristoylations occur at the consensus sequence in 
the N-terminal 9 residues.  However, recent studies suggest 
that many of them may not take part in the direct anchoring.  
Thus, PSORT does not use the result for further reasoning 
although the observation will be reported.

In contrast, all proteins linked to the glycosyl-
phosphatidylinositol (GPI) molecules are thought to be anchored 
at the extracellular surface of the plasma membrane.  PSORT 
recognizes GPI-anchored proteins by the knowledge that most of 
them are predicted to be type Ia membrane proteins with very 
short cytoplasmic tail (within 10 residues) and uses the result 
for the prediction of the localization site (plasma membrane) 
of the modified protein.

Lastly, there is a lipid modification known as isoprenylation 
or farnesylation.  This modification requires a CaaX motif in 
the C-terminus, where 'a' denotes an aliphatic amino acid.  
Isoprenylated proteins have been found in the plasma membrane 
and the nuclear envelope.  PSORT recognizes isoprenylated 
proteins by the motif and an additional rule that they do not 
have any transmembrane segments nor signal sequences.


Lysosomal and Vacuolar Proteins

Lysosomes are acidic organelles that contain numerous 
hydrolytic enzymes.  In yeast and plant cells, similar 
functions are recognized in vacuoles, which have diverse 
functions.

For soluble lysosomal proteins, the pathway which utilizes the 
post-translational modification of mannose 6-phosphate has been 
clarified.  However, there are no clear consensus patterns 
except for the NX(S/T) pattern necessary for N-glycosylation, 
likely because the modification is conformation-dependent.  
Since the prediction of protein conformation is very difficult, 
PSORT uses the discriminant score based on amino acid 
composition.

It is likely that yeast and most plant cells share part of 
their sorting mechanism.  Many of them have signal sequences in 
their N-terminus and have pro regions that are cleaved off 
after translocation.  Nevertheless, no common sequence features 
have been observed.  Again, PSORT uses the information of amino 
acid composition for discrimination  The amino acid composition 
of lysosomal and vacuolar soluble proteins turned out to be 
totally different.

The sorting mechanism of lysosomal membrane proteins seems 
different from that of lysosomal luminal proteins.  The 
existence of a GY motif within 17 residues from the membrane 
boundary in the cytoplasmic tails of type Ia proteins is used 
as a rule for discrimination.


V. NOTES ON THE KNWOLEDGE-BASED SYSTEM ---------------------

OPS83

The whole system is organized as an expert system with a 
knowledge-base which is a collection of 'if-then'-type rules.  
An expert system is an artificial intelligence technique in 
which computers are equipped with domain specific knowledge.  
The core part of our system was written in the programming 
language OPS83 and calculations involving sequence data are 
written in the C language and are called from rules when 
necessary.  Currently, about 100 core rules are stored in the 
knowledge base.


Reliability of Prediction Result

Current version of PSORT correctly classifies 83% of the 106 
Gram-negative proteins into one of the four localization sites.  
However, the prediction accuracy when applied to unknown 
sequences has not been estimated.

Of the 295 eukaryotic proteins used for the tuning of our 
system, 66% were correctly discriminated.  Moreover, of the 106 
proteins selected randomly from the localization sites 
including more than 10 members for testing, 59% were correctly 
predicted.  Many falsely predicted proteins seemed to be 
transported by specific pathways.

The prediction accuracy will be certainly improved by 
incorporating the future accumulation of our knowledge.


ACKNOWLEDGMENTS -----------------------------------------------

I would like to thank to Minoru Kanehisa, Tomoki Miwa, Ken'ichi 
Kawashima, Ikuo Uchiyama, and Atsushi Ogiwara.  Special thanks 
to Toshiyuki Okumura for setting up this server. 

----------------- end of help message -------------------------



knakai@ims.u-tokyo.ac.jp