Intelligent Systems for Molecular Biology 5 147-152 (1997)

Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier

Paul Horton and Kenta Nakai

Computer Sci. Dev., Univ. California, Berkeley

We have compared four classifiers on the problem of predicting the cellular localization sites of proteins in yeast and E. coli. A set of sequence derived features, such as regions of high hydrophobicity, were used for each classifier. The methods compared were a structured probabilistic model specifically designed for the localization problem, the k nearest neighbors classifier, the binary decision tree classifier, and the naive Bayes classifier. The result of tests using stratified cross validation shows the k nearest neighbors classifier to perform better than the other methods. In the case of yeast this difference was statistically siginificant using a cross-validated paired t test. The result is an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes. The best previously reported accuracies for these datasets were 55% and 81% respectively.


Last update November 7, 1997
nakai@imcb.osaka-u.ac.jp