Physical protein–protein interactions predicted from microarrays (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/24/22/2608/49055777/bioinformatics_24_22_2608.pdf

Physical protein–protein interactions predicted from microarrays

BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 22 2008, pages 2608–2614 doi:10.1093/bioinformatics/btn498 Systems biology Physical protein–protein interactions predicted from microarrays Ta-tsen Soong1,2,∗ , Kazimierz O. Wrzeszczynski1,3,4 and Burkhard Rost1,3,5 University Center for Computational Biology and Bioinformatics (C2B2), 2 Department of Biomedical Informatics, 3 Department of Biochemistry and Molecular Biophysics, 4 Integrated Program in Cellular, Molecular and Biomedical Studies and 5 NorthEast Structural Genomics Consortium (NESG) and New York Consortium on Membrane Proteins (NYCOMPS), Columbia University, New York, NY, USA 1 Columbia Received on April 26, 2008; revised on August 30, 2008; accepted on September 17, 2008 Associate Editor: Alfonso Valencia ABSTRACT Motivation: Microarray expression data reveal functionally associated proteins. However, most proteins that are associated are not actually in direct physical contact. Predicting physical interactions directly from microarrays is both a challenging and important task that we addressed by developing a novel machine learning method optimized for this task. Results: We validated our support vector machine-based method on several independent datasets. At the same levels of accuracy, our method recovered more experimentally observed physical interactions than a conventional correlation-based approach. Pairs predicted by our method to very likely interact were close in the overall network of interaction, suggesting our method as an aid for functional annotation. We applied the method to predict interactions in yeast (Saccharomyces cerevisiae). A Gene Ontology function annotation analysis and literature search revealed several probable and novel predictions worthy of future experimental validation. We therefore hope our new method will improve the annotation of interactions as one component of multi-source integrated systems. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 1.1 INTRODUCTION Protein interactions are crucial to medical biology Networks of protein–protein interactions provide a framework for the understanding of biological processes and can give insights into the mechanisms of diseases. Interaction networks can assist in designing drugs that modulate specific disease pathways (Ofran et al., 2005; Ryan and Matthews, 2005). The identification of protein–protein interactions is, therefore, of primary importance. Recent years have seen great advancements in experimental techniques, such as yeast two-hybrid (Y2H) and coimmunoprecipitation (CoIP) that probe protein interactions in a high-throughput fashion (Gavin et al., 2006; Giot et al., 2003; Ho et al., 2002; Ito et al., 2001; Uetz and Pankratz, 2004; Uetz et al., 2000). Y2H focuses on physical interaction between two proteins, while CoIP detects groups of proteins that are part of the same permanent or temporary complex. Most interactions are deposited in databases, ∗ To whom correspondence should be addressed. such as IntAct (Kerrien et al., 2007), DIP (Salwinski et al., 2004), BIND (Bader et al., 2003) and MIPS (Guldener et al., 2006). In this study, we focus on physical protein–protein interactions. 1.2 Physical interaction versus association The term ‘protein interaction’ has different meanings. We consider two proteins to interact physically if and only if some of their residues are in contact at some point in time. Assume, protein A activates B at time T1, separates from B at T2 and B regulates C at T3. A and C do not interact by our definition; instead, they are associated. Even for T1 = T2 and the three proteins form a somehow stable complex, by our definition A and C would still not physically interact. 1.3 Expression correlation poorly predicts physical interactions The Gene Expression Omnibus (GEO) database (Barrett et al., 2005) at the National Center for Biotechnology Information (NCBI) holds >200 000 microarray experiments (February 2008), and this is only one resource (Parkinson et al., 2005; Sherlock et al., 2001). Microarray data has been widely used in elucidating biological mechanisms, specifically in discovering functional modules, pathways (Bar-Joseph et al., 2003; Segal et al., 2003a) and reverse engineering regulatory networks (Hartemink, 2005; Margolin et al., 2006; Segal et al., 2003c). Microarrays provide noisy measures for the states of a complex biological system. Various types of systematic and stochastic fluctuations contribute to noise during biological sample preparation, hybridization, expression measurement and image processing (Schuchhardt et al., 2000). Another level of noise originates from the fact that each microarray experiment measures a single value for a gene that reflects its activity averaged across many biological processes. This mixing of underlying signals renders the inference of interactions particularly challenging. One approach to filtering systematic noise is the projection technique, which includes methods such as principal component analysis (PCA) and independent component analysis (ICA). They transform highdimensional input data into lower dimensional components that capture the most important variations in the original data (Alter et al., 2000; Lee and Batzoglou, 2003; Liebermeister, 2002). Since interacting proteins need to be present at the same time and place to physically contact each other, their expression as measured © 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Advance Access publication October 1, 2008 Protein interactions predicted from microarrays 2 2.1 Microarray data Protein–protein interaction data We downloaded the core yeast dataset from DIP (Deane et al., 2002; Salwinski et al., 2004) as our set of trusted interaction network. The set/network consisted of 5299 interactions between 2312 proteins. DIP considers these interactions to be of high quality; they mostly originated from Y2H or detailed experiments. These interactions constituted the body of all positives. Since current databases do not document negatives, we generated 5299 non-interactions by randomly pairing the 2312 proteins and excluding those known to interact (i.e. annotated in DIP). Our solution provides a more conservative estimate of accuracy than common approaches that pair proteins from different compartments (Ben-Hur and Noble, 2006; Jansen and Gerstein, 2004; Jansen et al., 2003). 2.3 where X is a 349 × 5823 matrix containing the original microarray expression values, P is a 349 × 349 matrix discovered by PCA or ICA representing the important directions of variation in the microarray data and Y is a 349 × 5823 matrix of principal components contai (...truncated)