Physical protein–protein interactions predicted from microarrays
BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 22 2008, pages 2608–2614
doi:10.1093/bioinformatics/btn498
Systems biology
Physical protein–protein interactions predicted from microarrays
Ta-tsen Soong1,2,∗ , Kazimierz O. Wrzeszczynski1,3,4 and Burkhard Rost1,3,5
University Center for Computational Biology and Bioinformatics (C2B2), 2 Department of Biomedical
Informatics, 3 Department of Biochemistry and Molecular Biophysics, 4 Integrated Program in Cellular, Molecular and
Biomedical Studies and 5 NorthEast Structural Genomics Consortium (NESG) and New York Consortium on
Membrane Proteins (NYCOMPS), Columbia University, New York, NY, USA
1 Columbia
Received on April 26, 2008; revised on August 30, 2008; accepted on September 17, 2008
Associate Editor: Alfonso Valencia
ABSTRACT
Motivation: Microarray expression data reveal functionally
associated proteins. However, most proteins that are associated
are not actually in direct physical contact. Predicting physical
interactions directly from microarrays is both a challenging and
important task that we addressed by developing a novel machine
learning method optimized for this task.
Results: We validated our support vector machine-based method
on several independent datasets. At the same levels of accuracy,
our method recovered more experimentally observed physical
interactions than a conventional correlation-based approach. Pairs
predicted by our method to very likely interact were close in the
overall network of interaction, suggesting our method as an aid for
functional annotation. We applied the method to predict interactions
in yeast (Saccharomyces cerevisiae). A Gene Ontology function
annotation analysis and literature search revealed several probable
and novel predictions worthy of future experimental validation. We
therefore hope our new method will improve the annotation of
interactions as one component of multi-source integrated systems.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
1.1
INTRODUCTION
Protein interactions are crucial to medical biology
Networks of protein–protein interactions provide a framework for
the understanding of biological processes and can give insights
into the mechanisms of diseases. Interaction networks can assist
in designing drugs that modulate specific disease pathways (Ofran
et al., 2005; Ryan and Matthews, 2005). The identification of
protein–protein interactions is, therefore, of primary importance.
Recent years have seen great advancements in experimental
techniques, such as yeast two-hybrid (Y2H) and coimmunoprecipitation (CoIP) that probe protein interactions in a high-throughput
fashion (Gavin et al., 2006; Giot et al., 2003; Ho et al., 2002; Ito
et al., 2001; Uetz and Pankratz, 2004; Uetz et al., 2000). Y2H
focuses on physical interaction between two proteins, while CoIP
detects groups of proteins that are part of the same permanent or
temporary complex. Most interactions are deposited in databases,
∗ To
whom correspondence should be addressed.
such as IntAct (Kerrien et al., 2007), DIP (Salwinski et al., 2004),
BIND (Bader et al., 2003) and MIPS (Guldener et al., 2006). In this
study, we focus on physical protein–protein interactions.
1.2
Physical interaction versus association
The term ‘protein interaction’ has different meanings. We consider
two proteins to interact physically if and only if some of their
residues are in contact at some point in time. Assume, protein A
activates B at time T1, separates from B at T2 and B regulates C
at T3. A and C do not interact by our definition; instead, they are
associated. Even for T1 = T2 and the three proteins form a somehow
stable complex, by our definition A and C would still not physically
interact.
1.3
Expression correlation poorly predicts physical
interactions
The Gene Expression Omnibus (GEO) database (Barrett et al.,
2005) at the National Center for Biotechnology Information (NCBI)
holds >200 000 microarray experiments (February 2008), and this
is only one resource (Parkinson et al., 2005; Sherlock et al.,
2001). Microarray data has been widely used in elucidating
biological mechanisms, specifically in discovering functional
modules, pathways (Bar-Joseph et al., 2003; Segal et al., 2003a)
and reverse engineering regulatory networks (Hartemink, 2005;
Margolin et al., 2006; Segal et al., 2003c).
Microarrays provide noisy measures for the states of a
complex biological system. Various types of systematic and
stochastic fluctuations contribute to noise during biological sample
preparation, hybridization, expression measurement and image
processing (Schuchhardt et al., 2000). Another level of noise
originates from the fact that each microarray experiment measures a
single value for a gene that reflects its activity averaged across many
biological processes. This mixing of underlying signals renders the
inference of interactions particularly challenging. One approach
to filtering systematic noise is the projection technique, which
includes methods such as principal component analysis (PCA)
and independent component analysis (ICA). They transform highdimensional input data into lower dimensional components that
capture the most important variations in the original data (Alter et al.,
2000; Lee and Batzoglou, 2003; Liebermeister, 2002).
Since interacting proteins need to be present at the same time and
place to physically contact each other, their expression as measured
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Advance Access publication October 1, 2008
Protein interactions predicted from microarrays
2
2.1
Microarray data
Protein–protein interaction data
We downloaded the core yeast dataset from DIP (Deane et al., 2002;
Salwinski et al., 2004) as our set of trusted interaction network. The
set/network consisted of 5299 interactions between 2312 proteins. DIP
considers these interactions to be of high quality; they mostly originated from
Y2H or detailed experiments. These interactions constituted the body of all
positives. Since current databases do not document negatives, we generated
5299 non-interactions by randomly pairing the 2312 proteins and excluding
those known to interact (i.e. annotated in DIP). Our solution provides a
more conservative estimate of accuracy than common approaches that pair
proteins from different compartments (Ben-Hur and Noble, 2006; Jansen and
Gerstein, 2004; Jansen et al., 2003).
2.3
where X is a 349 × 5823 matrix containing the original microarray expression
values, P is a 349 × 349 matrix discovered by PCA or ICA representing
the important directions of variation in the microarray data and Y is a
349 × 5823 matrix of principal components contai (...truncated)