A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis
Benjamin A Logsdon
0
Gabriel E Hoffman
0
Jason G Mezey
0
1
0
Department of Biological Statistics and Computational Biology, Cornell University
,
Ithaca, NY
,
USA
1
Department of Genetic Medicine, Weill Cornell Medical College
,
NY, NY
,
USA
Background: The success achieved by genome-wide association (GWA) studies in the identification of candidate loci for complex diseases has been accompanied by an inability to explain the bulk of heritability. Here, we describe the algorithm V-Bay, a variational Bayes algorithm for multiple locus GWA analysis, which is designed to identify weaker associations that may contribute to this missing heritability. Results: V-Bay provides a novel solution to the computational scaling constraints of most multiple locus methods and can complete a simultaneous analysis of a million genetic markers in a few hours, when using a desktop. Using a range of simulated genetic and GWA experimental scenarios, we demonstrate that V-Bay is highly accurate, and reliably identifies associations that are too weak to be discovered by single-marker testing approaches. V-Bay can also outperform a multiple locus analysis method based on the lasso, which has similar scaling properties for large numbers of genetic markers. For demonstration purposes, we also use V-Bay to confirm associations with gene expression in cell lines derived from the Phase II individuals of HapMap. Conclusions: V-Bay is a versatile, fast, and accurate multiple locus GWA analysis tool for the practitioner interested in identifying weaker associations without high false positive rates.
-
Background
Genome-wide association (GWA) studies have identified
genetic loci associated with complex diseases and other
aspects of human physiology [1,2]. All replicable
associations identified to date have been discovered using
GWA analysis techniques that analyze one genetic
marker at a time [3]. While successful, it is well appreciated
that single-marker analysis strategies may not be the
most powerful approaches for GWA analysis [4].
Multiple locus inference is an alternative to single-marker
GWA analysis that can have greater power to identify
weaker associations, which can arise due to small allelic
effects, low minor allele frequencies (MAF), and weak
correlations with genotyped markers [4]. By correctly
accounting for the effects of multiple loci, such
approaches can reduce the estimate of the error
variance, which in turn increases the power to detect
weaker associations for a fixed sample size. Since loci
with weaker associations may contribute to a portion of
the so-called missing or dark heritability [5-7],
multiple locus analyses have the potential to provide a more
complete picture of heritable variation.
Methods for multiple locus GWA analysis must address
a number of problems, including over-fitting where too
many associations are included in the genetic model, as
well as difficulties associated with model inference when
the number of genetic markers is far larger than the
sample size [8]. Two general approaches have been suggested
to address these challenges: hierarchical models and
partitioning/classification. Hierarchical modeling approaches
[9-14] employ an underlying regression framework to
model multiple marker-phenotype associations and use
the hierarchical model structure to implement penalized
likelihood [10], shrinkage estimation [15], or related
approaches to control over-fitting. These methods have
appealing statistical properties for GWA analysis when
both the sample size and the number of true associations
expected are far less than the number of markers analyzed,
which is generally considered a reasonable assumption in
GWA studies [8]. Alternatively, partitioning methods do
not (necessarily) assume a specific form of the
markerphenotype relationships but rather assume that markers
fall into non-overlapping classes, which specify phenotype
association or no phenotype association [13,16]. Control
of model over-fitting in high dimensional GWA marker
space can then be achieved by appropriate priors on
marker representation in these classes [13].
Despite the appealing theoretical properties of multiple
locus methods that make use of hierarchical models or
partitioning, these methods have not seen wide
acceptance for GWA analysis. There are at least two reasons
for this. First, an ideal multiple locus analysis involves
simultaneous assessment of all markers in a study and,
given the scale of typical GWA experiments, most
techniques are not computationally practical options
[9,10,16-18]. Second, there are concerns about the
accuracy and performance of multiple locus GWA analysis.
This is largely an empirical question that needs to be
addressed with simulations and analysis of real data.
Here we introduce the algorithm V-Bay, a
(V)ariational method for (Bay)esian hierarchical regression, that
can address some of the computational limitations
shared by many multiple locus methods [9,10,16-18].
The variational Bayes algorithm of V-Bay is part of a
broad class of approximate inference methods, which
have been successfully applied to develop scalable
algorithms for complex statistical problems, in the fields of
machine learning and computational statistics [19-22].
The specific type of variational method implemented in
V-Bay is a mean-field approximation, where a high
dimensional joint distribution of many variables (in this
case genetic marker effects) is approximated by a
product of many lower dimensional distributions [23]. This
method is extremely versatile and can be easily extended
to a range of models proposed for multiple locus
analysis [4,11,14,24].
The specific model implemented in V-Bay is a
hierarchical linear model, which includes marker class
partitioning control of model over-fitting. This is particularly
well suited for maintaining a low false-positive rate
when identifying weaker associations [13]. V-Bay
implements a simultaneous analysis of all markers in a GWA
study and, since the computational time complexity per
iteration of V-Bay is linear with respect to sample size
and marker number, the algorithm has fast convergence.
For example, simultaneous analysis of a million markers,
genotyped in more than a thousand individuals, can be
completed using a standard desktop (with large memory
capacity) in a matter of hours.
We take advantage of the computational speed of
V-Bay to perform a simulation study of performance,
for GWA data ranging from a hundred thousand to
more than a million markers. In the Results we focus
on the simulation results for single population
simulations, but we also implement a version of the algorithm
to accommodate known population structure and
missing genotype data. We demonstrate that in practice,
VBay consistently and reliably identifies both strong
marker associations, as well as those too weak to be
identified by single-marker analysis. We also demonstrate that
V-Bay can outperform a recently proposed multiple
locus methods tha (...truncated)