PSORT Users' Manual

WWW version (date of last revision: Oct. 8, 1999)

Kenta Nakai

Human Genome Center, IMS, U. Tokyo


CONTENTS

Introduction

Quick Start
System Requirements

Input Information

Source of Input Sequence
Sequence ID
Sequence Field

Output for Bacterial Sequences

Gram-positive or Gram-negative
Recognition of Signal Sequence
Recognition of Transmembrane Segments
Analysis of Lipoproteins
Analysis of Amino Acid Composition

Output for Eukaryotic Information

Yeast, Animal, or Plant
Recognition of Signal Sequence
Recognition of Transmembrane Segments
Prediction of Membrane Topology
Recognition of Mitochondrial Proteins
Recognition of Nuclear Proteins
Recognition of Peroxisomal Proteins
Recognition of Chloroplast Proteins
Recognition of ER (endoplasmic reticulum) Proteins
Analysis of Proteins in Vesicular Pathway
Lipid Anchors
Lysosomal and Vacuolar Proteins

Notes on the Knowledge-based System

OPS83
Reliability of Prediction Result


Introduction

Quick Start

In the input form, select an appropriate button for the source origin of your sequence and paste that sequence in the field. Click the "Submit" button, and you will get the output of PSORT. Its first part is the summary of your input for confirmation. The rest is the result of analyzing various sequence features related to protein sorting signals. Calculations are conveniently divided into two reasoning steps. The conclusive prediction result, i.e., the top 5 probable localization sites with their certainty factors is given finally.

System Requirements

The form-fill feature is used to specify input information. Thus, a WWW browser that supports this feature, such as NCSA Mosaic 2.x, is needed.


Input Information

Source of Input Sequence

Select one of the radio buttons to specify the source origin of the input sequence. This selection determines the candidate localization-sites for prediction as listed below:

Gram-positive bacterium:
(cytoplasmic) membrane, cytoplasm, and outside, i.e., the protein will be secreted.
Gram-negative bacterium:
cytoplasm, inner membrane, periplasm, and outer membrane.
yeast:
cytoplasm, mitochondria (outer membrane, intermembrane space, inner membrane, and matrix space), microbody (peroxisome), nucleus, endoplasmic reticulum, abbreviated as ER, (lumen and membrane), Golgi body, vacuole, plasma membrane, and outside.
animal:
cytoplasm, mitochondria (outer membrane, intermembrane space, inner membrane, and matrix space), microbody (peroxisome), nucleus, endoplasmic reticulum (lumen and membrane), Golgi body, lysosome (lumen and membrane), plasma membrane, and outside.
plant:
cytoplasm, mitochondria (outer membrane, intermembrane space, inner membrane, and matrix space), microbody (peroxisome), nucleus, endoplasmic reticulum (lumen and membrane), Golgi body, vacuole, plasma membrane, outside, and chloroplast (stroma, thylakoid membrane, and thylakoid space).

Sequence ID

If specified, the sequence ID will be embedded in the result. Only the first word of input is used and there is no restriction on its format. Default ID is "MYSEQ".

Sequence Field

Enter the sequence here by direct typing or by the copy & paste feature of any window systems. Characters except standard one-letter code for 20 amino acids, e.g., spaces, numeric, and carriage returns, will be removed off by the system. Small cases will be changed to capital cases. The input sequence is expected to be a direct translation from the genetic information and to contain all information for sorting. Thus, a warning message will be issued if it starts by an amino acid except M (methionine).


Output for Bacterial Sequences

Gram-positive or Gram-negative

In the current version, programs and parameters are the same for both kinds of bacteria. The inner membrane in Gram-negative bacteria is thought to be equivalent to the membrane of Gram-positive bacteria. And the outside in Gram-positive ones is further divided into either the periplasm or the outer membrane in Gram-negative ones.

Recognition of Signal Sequence

In Gram-negative bacteria, most periplasmic and outer membrane proteins have a signal sequence (also called a leader peptide) in the N-terminus, which is cleaved off after the translocation of the cytoplasmic membrane. Some of the cytoplasmic membrane proteins also have cleavable signal sequences but some N-terminal signal sequences in the cytoplasmic membrane proteins are not cleaved off, remaining as transmembrane segments.

PSORT first predicts the presence of signal sequences by McGeoch's method (D. J. McGeoch, Virus Research, 3, 271, 1985) modified by Nakai and Kanehisa, 1991. It considers the N-terminal basically-charged region (CR) and the central hydrophobic region (UR) of signal sequences. A discriminant score is calculated from the three values: length of UR, peak value of UR, and net charge of CR. These results are summarized in "McG". A large positive discriminant score means a high possibility to possess a signal sequence whether it is cleaved off or not.

Next, PSORT applies von Heijne's method of signal sequence recognition (G. von Heijne, Nucl. Acids Res., 14, 4683, 1986). It is a weight- matrix method and incorporates the information of consensus pattern around the cleavage sites (the (-3,-1)-rule) and thus it can be used to detect uncleavable signal sequences. The output score of this "GvH" is the original weight-matrix score (for prokaryotes) subtracted by 7.5. A large positive output means a high possibility that it has a cleavable signal sequence. The position of possible cleavage site, i.e., the most C-terminal position of a signal sequence, is also reported.

Recognition of Transmembrane Segments

In general, hydrophobic transmembrane segments exist in the cytoplasmic membrane proteins only. Thus, these segments can be regarded as the sorting signal into the cytoplasmic membrane.

PSORT employs Klein et al.'s method ("ALOM", also called as KKD) to detect potential transmembrane segments (P. Klein, M. Kanehisa, and C. DeLisi, Biochim. Biophys. Acta, 815, 468, 1985). It attempts to identify the most probable transmembrane segment from the average hydrophobicity value of 17-residue segments, if any. It predicts whether the segment is a transmembrane segment (INTEGRAL) or not (PERIPHERAL) comparing the discriminant score (reported as 'value') with a threshold parameter pre- defined to 0.0 for bacteria ('threshold'). For an integral membrane protein, position(s) of transmembrane segment(s) are also reported. Their length is fixed to 17 but their extension, i.e., the maximal range that satisfies the discriminant criterion, is also given in parentheses. The discrimination step mentioned above is continued after leaving out the segment till there remains no predicted transmembrane segment. The item 'count' is the number of predicted transmembrane segments.

Analysis of Lipoproteins

The signal sequence of lipoproteins, i.e., proteins with a covalently attached lipid molecule in their mature N-terminus, are essentially the same as those of usual proteins except the region around their cleavage sites. Thus, they can be recognized by the combination of McGeoch's method and the consensus motif around the cleavage site formulated by von Heijne (G. von Heijne, Protein Eng., 2, 531, 1989). The program is named as "Lipop" here. It gives the possible modification site around the end position of preceding CR region defined in McGeoch's method for a probable lipoprotein; otherwise, it returns a dummy modification site, -1.

Since the N-terminal lipid moieties of lipoproteins are thought to be integrated into membranes, they are predicted to be membrane-associated proteins. Further discrimination between the cytoplasmic membrane or the outer membrane is done as follows based on the experiment of Yamaguchi et al. (K. Yamaguchi, F. Yu, and M. Inoue, Cell, 53, 423, 1988): If a lipoprotein has a negatively charged residue at the second or third position of the mature part, it is sorted to the inner membrane; otherwise, it is sorted to the outer membrane.

Analysis of Amino Acid Composition

Although outer membrane proteins are integrated into the membrane, they do not have any hydrophobic segments which characterize usual integral membrane proteins. It is likely because their membrane-spanning parts consist of b strands. In addition, the sorting signal which discriminates outer membrane proteins from periplasmic proteins is not well characterized. Therefore, PSORT uses the information of amino acid composition of the predicted mature portion for their discrimination ( Nakai and Kanehisa, 1991) considering the N-terminal signal sequence. That is, a discriminant score is calculated from the linear combination of the percentage of 10 amino acids. Its large positive value means the tendency to be an outer membrane protein.


Output for Eukaryotic Information

Yeast, Animal, or Plant

In this version of PSORT, parameters for analyzing yeast or plant sequences are almost the same with parameters for animal sequences. Yeast and plant have a candidate site vacuole instead of lysosome in animal. In yeast, the consensus sequence for ER-lumen retention is HDEL rather than KDEL in others. Lastly, plants have chloroplasts (stroma etc.) as extra-candidates.

Recognition of Signal Sequence

In eukaryotes, proteins sorted through the so-called vesicular pathway (bulk flow) usually have a signal sequence (also called a leader peptide) in the N- terminus, which is cleaved off after the translocation through the ER membrane. Some N-terminal signal sequences are not cleaved off, remaining as transmembrane segments but it does not mean these proteins are retained in the ER; they can be further sorted included in vesicles.

PSORT first predicts the presence of signal sequences by McGeoch's method (D. J. McGeoch, Virus Research, 3, 271, 1985) modified by Nakai and Kanehisa, 1991. It considers the N-terminal basically-charged region (CR) and the central hydrophobic region (UR) of signal sequences. A discriminant score is calculated from the three values: length of UR, peak value of UR, and net charge of CR. These results are summarized in "McG". A large positive discriminant score means a high possibility to possess a signal sequence whether it is cleaved off or not.

Next, PSORT applies von Heijne's method of signal sequence recognition (G. von Heijne, Nucl. Acids Res., 14, 4683, 1986). It is a weight- matrix method and incorporates the information of consensus pattern around the cleavage sites (the (-3,-1)-rule) and thus it can be used to detect uncleavable signal sequences. The output score of this "GvH" is the original weight-matrix score (for eukaryotes) subtracted by 3.5. A large positive output means a high possibility that it has a cleavable signal sequence. The position of possible cleavage site, i.e., the most C-terminal position of a signal sequence, is also reported.

Recognition of Transmembrane Segments

The current version of PSORT assumes that all integral membrane proteins have hydrophobic transmembrane segment(s) which are thought to be alpha- helices in membranes.

PSORT employs Klein et al.'s method ("ALOM", also called as KKD) to detect potential transmembrane segments (P. Klein, M. Kanehisa, and C. DeLisi, Biochim. Biophys. Acta, 815, 468, 1985). It attempts to identify the most probable transmembrane segment from the average hydrophobicity value of 17-residue segments, if any. It predicts whether the segment is a transmembrane segment (INTEGRAL) or not (PERIPHERAL) comparing the discriminant score (reported as 'value') with a threshold parameter pre- defined to 0.0 for bacteria ('threshold'). For an integral membrane protein, position(s) of transmembrane segment(s) are also reported. Their length is fixed to 17 but their extension, i.e., the maximal range that satisfies the discriminant criterion, is also given in parentheses. The discrimination step mentioned above is continued after leaving out the segment till there remains no predicted transmembrane segment. The item 'count' is the number of predicted transmembrane segments.

However, the ALOM program, which has been ranked as one of the best methods for evaluation, is not sufficient to predict the exact number of transmembrane segments of polytopic, i.e., multiple membrane-spanning, proteins. Thus, we used two threshold values for more precise prediction of eukaryotic membrane proteins: when predicted to be a polytopic, protein, a less stringent value was employed for the prediction of more realistic number of transmembrane segments. It seems probable that once integrated into the membrane, less hydrophobic segments are also integrated into it.

Prediction of Membrane Topology

Membrane proteins have their spcefic way to integrate into the membrane in respect to the two sides (cytoplasmic or exo-cytoplasmic), which is called as membrane toplogy. We used Singer's classification for membrane topology (S. J. Singer, Ann. Rev. Cell Biol., 6, 247, 1990). Prediction of membrane topology is important because some sorting signals reside in specific positions in specific topologies, e.g., cytoplasmic tail (see below).

PSORT uses Hartmann et al.'s method (E. Hartmann, T. A. Rapoport, and H. F. Lodish, Proc. Natl. Acad. Sci. USA, 86, 5786, 1989); called "MTOP" here) for the prediction of membrane topology, which assumes that the overall topology is determined from the net charge difference of both sides of 15 residues flanking the most N-terminal transmembrane segment. In the outpu, 'I(middle)' means the central position of the most N-terminal segment.

Since the N-terminal transmembrane segments of type Ib proteins were often wrongly predicted to be cleaved off by von Heijne's method, we introduced the hypothesis that if the charge difference of the most N-terminal transmembrane segment is reversed to that of usual ER-transferons, it is not cleaved. Since some cleavable ER-transferons had a reversed charge difference, we had to change the originally reported threshold value. PSORT also uses a heuristic that transmembrane segments of many type II proteins reside apart from the N-terminus to some degree.

In addition, there seems to be a preference of membrane topology in each localization site. For example, type Ib proteins are favored at the ER while type II tend towards the Golgi complex and the plasma membrane. PSORT uses such empirical knowledge for prediction.

Recognition of Mitochondrial Proteins

In mitochondria, many proteins are sorted through a 'conservative' pathway while others are sorted through 'nonconservative' pathways from the cytoplasm. The proteins sorted through the former have mitochondrial matrix targeting signals in their N-terminus. On the contrary, sequence features of protein sorting signals with 'nonconservative' pathways are hardly recognizable.

PSORT employs a simple method to recognize mitochondrial targeting signals using the discriminant analysis from values of partial amino acid composition (Nakai and Kanehisa, 1992). For example, the arginine content turned out to be effective for prediction. PSORT also reports some consensus sequence patterns around cleavage sites (the item "Gavel" from Y. Gavel and G. von Heijne, Prot. Eng., 4, 33, 1990). However, the result is not used in our prediction.

Proteins targeted to the mitochondrial intermembrane space via the 'conservative' pathway, have an N-terminal signal of bipartite structure: its N-terminal half appears to be essentially a mitochondrial targeting signal and its C-terminal half is the signal for the translocation from the matrix to the intermembrane space. PSORT recognizes the N-terminal halves by the above-mentioned discriminant analysis. As for the C-terminal halves, PSORT uses an original method for the detection of apolar segments ("APOLAR").

Since only a few mitochondrial outer membraneproteins have been sequenced, its prediction result can not have general applicability. Many proteins localized at the mitochondrial inner membrane are likely to be peripheral membrane proteins which exist as members of large membrane complexes and their degree of hydrophobicity is relatively low compared with membrane proteins in the vesicular pathway. Thus, although PSORT uses the ALOM program for detecting them, it awaits further improvement.

Recognition of Nuclear Proteins

Although it seems possible that a protein without its own nuclear targeting signal enters the nucleus via cotransport with a protein that has one, many nuclear proteins have their own targeting signals. Their most common type is that of SV40 large T antigen. PSORT uses the following two rules to detect it: 4 residue pattern composed of basic amino acids (K or R), or composed of three basic amino acids (K or R) and H or P; a pattern starting with P and followed within 3 residues by a basic segment containing 3 K or R residues out of 4 residues.

Another type of nuclear targeting signal is the type of Xenopus nucleoplasmin proposed by Robbins et al. (J. Robbins, S. M. Dilworth, R. A. Laskey, and C. Dingwall, Cell, 64, 615, 1991). The pattern is: 2 basic residues, 10 residue spacer, and another basic region consisting of at least 3 basic residues out of 5 residues.

PSORT used a heuristic that nuclear proteins are generally rich in basic residues: If the sum of K and R compositions are higher than 20%, then the protein is considered to have higher possibility of being nuclear than cytoplasmic. In addition, it also examines the presence of RNP (ribonucleoprotein) consensus motif because some RNPs are transported to the nucleus by signals existing in the bound RNAs. However, it is apparently insufficient for actual prediction.

Note that we classify ribosomal proteins as nuclear proteins because they have nuclear targeting signals and are once transported into the nucleus.

Recognition of Peroxisomal Proteins

Peroxisomes, sometimes called glyoxisomes, glycosomesare, or microbodies, are organelles found in almost every eukaryotic cell. As a sorting signal, the importance of the C-terminal three residues, (S/A(/C))(K/R/H)L, has been indicated (the SKL motif). However, since many peroxisomal proteins do not have this motif at the appropriate position, PSORT uses a heuristic that the presence of this motif at other positions also implicates the peroxisomal localization.

Although some peroxisomal proteins have N-terminal presequences which are cleaved off after translocation, it is not clear whether they are sorting signals. According to our preliminary analysis, the amino acid composition of the N-terminal 20 residues were not very effective as variables of discriminant analysis. Then, the amino acid composition of the entire sequence is used for supplemental information for prediction.

The sorting signal of peroxisomal membrane proteins is not known. Our training data of peroxisomal proteins contained a 70 K membrane protein. It was unclear whether our rule could also be applied to this protein, but it had three internal SKL motifs and was positive with the discriminant score although this protein was not included in the derivation of the function.

Recognition of Chloroplast Proteins

Proteins targeted to chloroplasts have cleavable signals in the N-terminus, the chloroplast (stroma) targeting signals. PSORT postulates that all stromal proteins and thylakoid membrane proteins have this kind of signal. It uses a discriminant score calculated from partial amino acid compositions (positions 3-10 and 1-30) and from the amplitude of maximum hydrophobic moment of 165 degrees (potential beta-structure) for residues 25 to 70 (Nakai and Kanehisa, 1992). The form of discriminat function shows the abundance of alanine and serine residues in the N-terminal 30 residues. In addition, the observation that the second residue is often alanine is also used.

Like some mitochondrial proteins, proteins of chloroplast thylakoid lumen have a bipartite signal in their N-terminus. Its N-terminal half is essentially the same as a stroma targeting signal and the C-terminal half is used for the translocation from the stroma to the thylakoid lumen. For the detection of latter signal, another clue, PSORT uses both the result of APOLAR algorithm applied to the limited region of residues 40 to 90 and a weight matrix score around the cleavage sites (C. J. Howe and T. P. Wallace, Nucl. Acids Res., 18, 3417, 1990).

Thylakoid membrane proteins were discriminated by ALOM. The remainder of chloroplast proteins are tentatively regarded as stromal proteins.

Recognition of ER (endoplasmic reticulum) Proteins

PSORT postulates that the proteins with N-terminal signal sequence will be transported to the cell surface by default unless they have any other signals for specific retention or commitment; a luminal protein will be secreted constitutively to the extracellular space and a membrane protein will reside at the plasma membrane.

The retention signal of ER luminal proteins from the bulk flow is the existence of the sequence motif KDEL in the C-terminus. In yeast and some plants, the consensus motif is HDEL. Although some variations of this motif are allowed in some organisms and cell types, they were not required for the discrimination of our current data.

Compared with the KDEL motif, retention signal(s) for ER membrane proteins seems less evident as a sequence motif: in one analysis using mutagenesis, two lysines positioned three and four or five residues from the C-terminus turned out to be important in some type Ia proteins (M. R. Jackson, T. Nilsson, and P. A. Peterson, EMBO J., 9, 3153, 1990). This is one example of various comparton signals existing in cytoplasmic tails (see below). However, many ER membrane proteins do not have this kind of sequence motif. The preference of membrane topology was a rather useful clue.

Analysis of Proteins in Vesicular Pathway

As already exemplified above, many sorting signals in the membrane proteins have been found in cytoplasmic tails which are short terminal segments exposed to the cytoplasm in type Ia, Ib, and II proteins (in Singer's terminology).

In relation to the default pathway of secretion, there is a pathway for protein internalization through coated-pit mediated endocytosis. Two sequence motifs, NPXY and YXRF, have been identified as signals for this rapid internalization process. PSORT uses these sequence motifs as clues identifying plasma membrane proteins.

PSORT also uses a proposed consensus motif, (S/T)X(E/Q)(R/K), near the probable transmembrane domain of all Golgi-localized glycosyltransferases (B. Bendiak, Biochem. Biophys. Res. Comm., 170, 879, 1990) in addition to the above-mentioned heuristic on membrane topology.

Lipid Anchors

The protein modification reactions which bind lipid molecules to proteins are important because a linked lipid moiety can be integrated into various membranes and can anchor the bound protein.

For example, myristoylations occur at the consensus sequence in the N-terminal 9 residues. However, recent studies suggest that many of them may not take part in the direct anchoring. Thus, PSORT does not use the result for further reasoning although the observation will be reported.

In contrast, all proteins linked to the glycosyl-phosphatidylinositol (GPI) molecules are thought to be anchored at the extracellular surface of the plasma membrane. PSORT recognizes GPI-anchored proteins by the knowledge that most of them are predicted to be type Ia membrane proteins with very short cytoplasmic tail (within 10 residues) and uses the result for the prediction of the localization site (plasma membrane) of the modified protein.

Lastly, there is a lipid modification known as isoprenylation or farnesylation. This modification requires a CaaX motif in the C-terminus, where 'a' denotes an aliphatic amino acid. Isoprenylated proteins have been found in the plasma membrane and the nuclear envelope. PSORT recognizes isoprenylated proteins by the motif and an additional rule that they do not have any transmembrane segments nor signal sequences.

Lysosomal and Vacuolar Proteins

Lysosomes are acidic organelles that contain numerous hydrolytic enzymes. In yeast and plant cells, similar functions are recognized in vacuoles, which have diverse functions.

For soluble lysosomal proteins, the pathway which utilizes the post- translational modification of mannose 6-phosphate has been clarified. However, there are no clear consensus patterns except for the NX(S/T) pattern necessary for N-glycosylation, likely because the modification is conformation-dependent. Since the prediction of protein conformation is very difficult, PSORT uses the discriminant score based on amino acid composition.

It is likely that yeast and most plant cells share part of their sorting mechanism. Many of them have signal sequences in their N-terminus and have pro regions that are cleaved off after translocation. Nevertheless, no common sequence features have been observed. Again, PSORT uses the information of amino acid composition for discrimination The amino acid composition of lysosomal and vacuolar soluble proteins turned out to be totally different.

The sorting mechanism of lysosomal membrane proteins seems different from that of lysosomal luminal proteins. The existence of a GY motif within 17 residues from the membrane boundary in the cytoplasmic tails of type Ia proteins is used as a rule for discrimination.


Notes on the Knowledge-based System

OPS83

The whole system is organized as an expert system with a knowledge-base which is a collection of 'if-then'-type rules. An expert system is an artificial intelligence technique in which computers are equipped with domain specific knowledge. The core part of our system was written in the programming language OPS83 and calculations involving sequence data are written in the C language and are called from rules when necessary. Currently, about 100 core rules are stored in the knowledge base.

Reliability of Prediction Result

Current version of PSORT correctly classifies 83% of the 106 Gram-negative proteins into one of the four localization sites. However, the prediction accuracy when applied to unknown sequences has not been estimated.

Of the 295 eukaryotic proteins used for the tuning of our system, 66% were correctly discriminated. Moreover, of the 106 proteins selected randomly from the localization sites including more than 10 members for testing, 59% were correctly predicted. Many falsely predicted proteins seemed to be transported by specific pathways.

The prediction accuracy will be certainly improved by incorporating the future accumulation of our knowledge.


nakai@imcb.osaka-u.ac.jp