SNPsyn: detection and exploration of SNP–SNP interactions
W444–W449 Nucleic Acids Research, 2011, Vol. 39, Web Server issue
doi:10.1093/nar/gkr321
Published online 16 May 2011
SNPsyn: detection and exploration of SNP–SNP
interactions
Tomaz Curk1,*, Gregor Rot1 and Blaz Zupan1,2,*
1
Faculty of Computer and Information Science, University of Ljubljana, Trzaska cesta 25, SI-1000 Ljubljana,
Slovenia and 2Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza,
Houston, TX 77030, USA
Received March 5, 2011; Revised April 15, 2011; Accepted April 20, 2011
ABSTRACT
INTRODUCTION
Current genome-wide case-control association studies
(GWAS) focus on identifying a set of single nucleotide
polymorphisms (SNPs) that are most associated with the
disease under study. While individual SNPs are important
indicators of main genetic components of complex
diseases, they explain only a fraction of the genetic risk
(1). Because of the low or at best modest information
content of individual SNPs, it has been suggested (2)
that uncovering synergy among genes may improve the
predictive accuracy of models. A recent report by Gerke
et al. (3) also suggests that synergistic combinations may
carry information about the phenotype that cannot be
discovered from observations of individual SNPs alone.
*To whom correspondence should be addressed. Tel: +386 1 4768 267; Fax: +386 1 4264 647; Email:
Correspondence may also be addressed to Blaz Zupan. Tel: +386 1 4768 402; Fax: +386 1 4264 647; Email:
ß The Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
SNPsyn (http://snpsyn.biolab.si) is an interactive
software tool for the discovery of synergistic pairs
of single nucleotide polymorphisms (SNPs) from
large genome-wide case-control association studies
(GWAS) data on complex diseases. Synergy among
SNPs is estimated using an information-theoretic
approach called interaction analysis. SNPsyn is
both a stand-alone C++/Flash application and a
web server. The computationally intensive part is
implemented in C++ and can run in parallel on a
dedicated cluster or grid. The graphical user interface is written in Adobe Flash Builder 4 and can
run in most web browsers or as a stand-alone application. The SNPsyn web server hosts the Flash
application, receives GWAS data submissions,
invokes the interaction analysis and serves result
files. The user can explore details on identified synergistic pairs of SNPs, perform gene set enrichment
analysis and interact with the constructed SNP
synergy network.
An unequivocal proof of existence of SNP synergy
would push the modeling efforts from trying to add
effects of individual most informative SNPs towards
models that include non-additive SNP interactions, in
this way providing important insight into complex
diseases and underlying molecular mechanisms.
Various approaches to detect synergy have been
proposed, which is commonly referred to as positive interaction (4), k-way interaction information (5), epistasis
(6,7) or SNP synergy (8). In this article, we use the term
‘synergy’ and present a software tool that implements an
information-theoretic approach to synergistic interaction
analysis (4,5,8). Contrary to other approaches, interaction
analysis does not require the user to specify which gene
interaction models to test, but instead it discovers them
from data. It assumes an additive model, where the
expected amount of information on the phenotype for a
combination of SNPs is equal to the sum of information
of individual SNPs. Synergy is said to occur when a combination carries more information than the sum of information provided by individual SNPs (4,8). This difference
between the ‘whole’ and ‘sum of parts’ cannot be gained
from observations of individual SNPs alone, but only by
simultaneously observing a combination of SNPs.
Various degrees of synergy are associated with different
SNP pair models (9). An extreme case is when the
outcome is an XOR function of two SNPs. There, each
individual SNP does not carry any information on the
phenotype, while a simultaneous consideration of the
two SNPs produces a perfect association with disease.
This extreme case illustrates that, by definition, it is not
possible to predict which SNPs will form a synergistic
combination by observing individual SNPs alone. Two
SNPs must first be combined into a new feature, and
only then can the total information content for that particular combination be computed.
Consequently, to discover a set of best-interacting SNPs
we need to test exhaustively all possible combinations.
The number of SNP combinations grows exponentially
Nucleic Acids Research, 2011, Vol. 39, Web Server issue W445
Mutual information I(M; P), also called information gain,
is based on calculations of entropy and corresponds to the
level of association (i.e. shared information) between
marker M and phenotype P. Given the value of marker
M, mutual information estimates how well can we predict
the value of phenotype P. The new feature f(M1, M2) may
be derived by Cartesian product of values of SNPs M1
and M2 or by other methods for feature construction,
e.g. Kramers method (11) or constructive induction by
feature decomposition (12). For reasons of simplicity
and speed, SNPsyn uses Cartesian product. Pairs of
SNPs with positive synergy (Syn > 0) are called synergistic. Negative synergy (Syn < 0) indicates that the two
SNPs carry redundant information, an effect typically
observed among highly correlated SNPs. For further
details on interaction analysis see Jakulin and Bratko (4)
and a review by Anastassiou (8).
METHODS AND IMPLEMENTATION
Compact data format
SNPsyn aims to optimize the computational time and at
the same time provides an interaction-rich graphical user
interface. The computationally intensive data analysis is
implemented in C++. This computational library implements functions for calculating mutual information and
information gain of individual and pairs of SNPs and
synergy of pairs of SNPs. The library also includes functions for random data sampling and shuffling, estimation
of probability distribution, calculation of false discovery
rate [FDR, (10)] and functions for the subdivision of the
analysis into independent subtasks that can run in parallel.
Example scripts to perform the analysis in parallel on a
cluster or grid are included in the distribution package.
SNPsyn’s C++ library can be used to build custom applications for interaction analysis. A command-line interface
to the library is provided, and is actually used by
SNPsyn’s web server to perform interaction analysis.
Results of interaction analysis are presented to the user
through an interactive web application with a graphical
user interface (GUI). The (...truncated)