Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species
KiYoung Lee
0
1
2
6
Han-Yu Chuang
2
5
Andreas Beyer
2
4
Min-Kyung Sung
3
Won-Ki Huh
3
Bonghee Lee
1
Trey Ideker
2
5
0
Structural Biology Laboratory, Salk Institute for Biology Studies
, 10010 North Torrey Pines Road,
La Jolla, CA 92037, USA
1
Center for Genomics and Proteomics, Lee Gil Ya Cancer and Diabetes Institute, Gachon University of Medicine and Science
, Incheon 406-799,
Republic of Korea
2
Department of Bioengineering, University of California San Diego
,
La Jolla, CA 92093, USA
3
School of Biological Sciences, Research Center for Functional Cellulomics, Institute of Microbiology, Seoul National University
, Seoul 151-747,
Republic of Korea
4
Biotechnology Center, Technische Universita t
,
01062 Dresden, Germany
5
Bioinformatics Program, University of California San Diego
,
La Jolla, CA 92093, USA
6
Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST)
, Daejeon 305-701,
Republic of Korea
The function of a protein is intimately tied to its subcellular localization. Although localizations have been measured for many yeast proteins through systematic GFP fusions, similar studies in other branches of life are still forthcoming. In the interim, various machine-learning methods have been proposed to predict localization using physical characteristics of a protein, such as amino acid content, hydrophobicity, side-chain mass and domain composition. However, there has been comparatively little work on predicting localization using protein networks. Here, we predict protein localizations by integrating an extensive set of protein physical characteristics over a protein's extended protein-protein interaction neighborhood, using a classification framework called 'Divide and Conquer k-Nearest Neighbors' (DC-kNN). These predictions achieve significantly higher accuracy than two well-known methods for predicting protein localization in yeast. Using new GFP imaging experiments, we show that the network-based approach can extend and revise previous annotations made from highthroughput studies. Finally, we show that our approach remains highly predictive in higher eukaryotes such as fly and human, in which most localizations are unknown and the protein network coverage is less substantial.
-
INTRODUCTION
For a protein to operate properly, it must reside in the
correct compartment of a cell. Knowing the subcellular
localization of a protein, therefore, is an important step
to understanding its function (1,2). In budding and fission
yeast (14), systematic protein localization experiments
have been carried out through GFP fusions to each
open reading frame at the 30- or 50-end. Such studies
have not yet been performed in higher eukaryotes such
as Caenorhabditis elegans, Drosophila melanogaster or
mammals, due to the larger proteome sizes and the
technical difficulties associated with protein tagging in those
species (57). In the interim, reliable and efficient
computational methods are required to predict the subcellular
localization of a newly identified protein.
A considerable number of classification methods have
been developed for this purpose (524). Typically, these
algorithms input a list of features with which to
characterize a protein, such as its molecular weight, amino acid
content, codon bias, hydrophobicity, side-chain mass
and so on. During the training phase, they learn to
recognize which features, or patterns of features, are best able
to classify a set of gold-standard proteins whose
localizations are well known. To date, amino acid content
has been a very successful and widely used feature
(5,6,8,1116). Other informative features have been
protein sorting signal motifs near the N-terminus (18), as well
as protein sequence motifs (7,912,16,24) and Gene
Ontology terms (5). Classification of these features has
relied on a variety of algorithms, including Least
Distance Algorithms (20,21), an Artificial Neural
Network (10), a Nearest Neighbor approach (5,14), a
Markov Model (22), a Bayesian Network approach (9),
Support Vector Machines (SVMs) (13,15,16) and Support
Vector Data Description (SVDD) (6).
Early methods attempted to classify proteins into a
small number of compartments, e.g. intracellular versus
extracellular (19). More recently, many compartmental
localizations have been defined, including not only
membrane-enclosed organelles but also categories such
as spindle pole or microtubule association. Current
prediction algorithms in yeast cover as many as 22 distinct
cellular localizations (5,6). Not surprisingly, approaches
which limit their predictions to smaller numbers of
localizations have performed better than approaches which
attempt to predict many. Moreover, most of these studies
have demonstrated their predictions assuming a single
localization per protein within a single species such as
yeast. Therefore, some open challenges for new methods
development are to: (i) increase the classification accuracy
when predicting across many cellular (...truncated)