Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species

Nucleic Acids Research, Nov 2008

The function of a protein is intimately tied to its subcellular localization. Although localizations have been measured for many yeast proteins through systematic GFP fusions, similar studies in other branches of life are still forthcoming. In the interim, various machine-learning methods have been proposed to predict localization using physical characteristics of a protein, such as amino acid content, hydrophobicity, side-chain mass and domain composition. However, there has been comparatively little work on predicting localization using protein networks. Here, we predict protein localizations by integrating an extensive set of protein physical characteristics over a protein's extended protein–protein interaction neighborhood, using a classification framework called ‘Divide and Conquer k-Nearest Neighbors’ (DC-kNN). These predictions achieve significantly higher accuracy than two well-known methods for predicting protein localization in yeast. Using new GFP imaging experiments, we show that the network-based approach can extend and revise previous annotations made from high-throughput studies. Finally, we show that our approach remains highly predictive in higher eukaryotes such as fly and human, in which most localizations are unknown and the protein network coverage is less substantial.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://nar.oxfordjournals.org/content/36/20/e136.full.pdf

Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species

KiYoung Lee 0 1 2 6 Han-Yu Chuang 2 5 Andreas Beyer 2 4 Min-Kyung Sung 3 Won-Ki Huh 3 Bonghee Lee 1 Trey Ideker 2 5 0 Structural Biology Laboratory, Salk Institute for Biology Studies , 10010 North Torrey Pines Road, La Jolla, CA 92037, USA 1 Center for Genomics and Proteomics, Lee Gil Ya Cancer and Diabetes Institute, Gachon University of Medicine and Science , Incheon 406-799, Republic of Korea 2 Department of Bioengineering, University of California San Diego , La Jolla, CA 92093, USA 3 School of Biological Sciences, Research Center for Functional Cellulomics, Institute of Microbiology, Seoul National University , Seoul 151-747, Republic of Korea 4 Biotechnology Center, Technische Universita t , 01062 Dresden, Germany 5 Bioinformatics Program, University of California San Diego , La Jolla, CA 92093, USA 6 Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST) , Daejeon 305-701, Republic of Korea The function of a protein is intimately tied to its subcellular localization. Although localizations have been measured for many yeast proteins through systematic GFP fusions, similar studies in other branches of life are still forthcoming. In the interim, various machine-learning methods have been proposed to predict localization using physical characteristics of a protein, such as amino acid content, hydrophobicity, side-chain mass and domain composition. However, there has been comparatively little work on predicting localization using protein networks. Here, we predict protein localizations by integrating an extensive set of protein physical characteristics over a protein's extended protein-protein interaction neighborhood, using a classification framework called 'Divide and Conquer k-Nearest Neighbors' (DC-kNN). These predictions achieve significantly higher accuracy than two well-known methods for predicting protein localization in yeast. Using new GFP imaging experiments, we show that the network-based approach can extend and revise previous annotations made from highthroughput studies. Finally, we show that our approach remains highly predictive in higher eukaryotes such as fly and human, in which most localizations are unknown and the protein network coverage is less substantial. - INTRODUCTION For a protein to operate properly, it must reside in the correct compartment of a cell. Knowing the subcellular localization of a protein, therefore, is an important step to understanding its function (1,2). In budding and fission yeast (14), systematic protein localization experiments have been carried out through GFP fusions to each open reading frame at the 30- or 50-end. Such studies have not yet been performed in higher eukaryotes such as Caenorhabditis elegans, Drosophila melanogaster or mammals, due to the larger proteome sizes and the technical difficulties associated with protein tagging in those species (57). In the interim, reliable and efficient computational methods are required to predict the subcellular localization of a newly identified protein. A considerable number of classification methods have been developed for this purpose (524). Typically, these algorithms input a list of features with which to characterize a protein, such as its molecular weight, amino acid content, codon bias, hydrophobicity, side-chain mass and so on. During the training phase, they learn to recognize which features, or patterns of features, are best able to classify a set of gold-standard proteins whose localizations are well known. To date, amino acid content has been a very successful and widely used feature (5,6,8,1116). Other informative features have been protein sorting signal motifs near the N-terminus (18), as well as protein sequence motifs (7,912,16,24) and Gene Ontology terms (5). Classification of these features has relied on a variety of algorithms, including Least Distance Algorithms (20,21), an Artificial Neural Network (10), a Nearest Neighbor approach (5,14), a Markov Model (22), a Bayesian Network approach (9), Support Vector Machines (SVMs) (13,15,16) and Support Vector Data Description (SVDD) (6). Early methods attempted to classify proteins into a small number of compartments, e.g. intracellular versus extracellular (19). More recently, many compartmental localizations have been defined, including not only membrane-enclosed organelles but also categories such as spindle pole or microtubule association. Current prediction algorithms in yeast cover as many as 22 distinct cellular localizations (5,6). Not surprisingly, approaches which limit their predictions to smaller numbers of localizations have performed better than approaches which attempt to predict many. Moreover, most of these studies have demonstrated their predictions assuming a single localization per protein within a single species such as yeast. Therefore, some open challenges for new methods development are to: (i) increase the classification accuracy when predicting across many cellular (...truncated)


This is a preview of a remote PDF: https://nar.oxfordjournals.org/content/36/20/e136.full.pdf

KiYoung Lee, Han-Yu Chuang, Andreas Beyer, Min-Kyung Sung, Won-Ki Huh, Bonghee Lee, Trey Ideker. Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species, Nucleic Acids Research, 2008, pp. e136-e136, 36/20, DOI: 10.1093/nar/gkn619