A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-11-58.pdf

A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis

Benjamin A Logsdon 0 Gabriel E Hoffman 0 Jason G Mezey 0 1 0 Department of Biological Statistics and Computational Biology, Cornell University , Ithaca, NY , USA 1 Department of Genetic Medicine, Weill Cornell Medical College , NY, NY , USA Background: The success achieved by genome-wide association (GWA) studies in the identification of candidate loci for complex diseases has been accompanied by an inability to explain the bulk of heritability. Here, we describe the algorithm V-Bay, a variational Bayes algorithm for multiple locus GWA analysis, which is designed to identify weaker associations that may contribute to this missing heritability. Results: V-Bay provides a novel solution to the computational scaling constraints of most multiple locus methods and can complete a simultaneous analysis of a million genetic markers in a few hours, when using a desktop. Using a range of simulated genetic and GWA experimental scenarios, we demonstrate that V-Bay is highly accurate, and reliably identifies associations that are too weak to be discovered by single-marker testing approaches. V-Bay can also outperform a multiple locus analysis method based on the lasso, which has similar scaling properties for large numbers of genetic markers. For demonstration purposes, we also use V-Bay to confirm associations with gene expression in cell lines derived from the Phase II individuals of HapMap. Conclusions: V-Bay is a versatile, fast, and accurate multiple locus GWA analysis tool for the practitioner interested in identifying weaker associations without high false positive rates. - Background Genome-wide association (GWA) studies have identified genetic loci associated with complex diseases and other aspects of human physiology [1,2]. All replicable associations identified to date have been discovered using GWA analysis techniques that analyze one genetic marker at a time [3]. While successful, it is well appreciated that single-marker analysis strategies may not be the most powerful approaches for GWA analysis [4]. Multiple locus inference is an alternative to single-marker GWA analysis that can have greater power to identify weaker associations, which can arise due to small allelic effects, low minor allele frequencies (MAF), and weak correlations with genotyped markers [4]. By correctly accounting for the effects of multiple loci, such approaches can reduce the estimate of the error variance, which in turn increases the power to detect weaker associations for a fixed sample size. Since loci with weaker associations may contribute to a portion of the so-called missing or dark heritability [5-7], multiple locus analyses have the potential to provide a more complete picture of heritable variation. Methods for multiple locus GWA analysis must address a number of problems, including over-fitting where too many associations are included in the genetic model, as well as difficulties associated with model inference when the number of genetic markers is far larger than the sample size [8]. Two general approaches have been suggested to address these challenges: hierarchical models and partitioning/classification. Hierarchical modeling approaches [9-14] employ an underlying regression framework to model multiple marker-phenotype associations and use the hierarchical model structure to implement penalized likelihood [10], shrinkage estimation [15], or related approaches to control over-fitting. These methods have appealing statistical properties for GWA analysis when both the sample size and the number of true associations expected are far less than the number of markers analyzed, which is generally considered a reasonable assumption in GWA studies [8]. Alternatively, partitioning methods do not (necessarily) assume a specific form of the markerphenotype relationships but rather assume that markers fall into non-overlapping classes, which specify phenotype association or no phenotype association [13,16]. Control of model over-fitting in high dimensional GWA marker space can then be achieved by appropriate priors on marker representation in these classes [13]. Despite the appealing theoretical properties of multiple locus methods that make use of hierarchical models or partitioning, these methods have not seen wide acceptance for GWA analysis. There are at least two reasons for this. First, an ideal multiple locus analysis involves simultaneous assessment of all markers in a study and, given the scale of typical GWA experiments, most techniques are not computationally practical options [9,10,16-18]. Second, there are concerns about the accuracy and performance of multiple locus GWA analysis. This is largely an empirical question that needs to be addressed with simulations and analysis of real data. Here we introduce the algorithm V-Bay, a (V)ariational method for (Bay)esian hierarchical regression, that can address some of the computational limitations shared by many multiple locus methods [9,10,16-18]. The variational Bayes algorithm of V-Bay is part of a broad class of approximate inference methods, which have been successfully applied to develop scalable algorithms for complex statistical problems, in the fields of machine learning and computational statistics [19-22]. The specific type of variational method implemented in V-Bay is a mean-field approximation, where a high dimensional joint distribution of many variables (in this case genetic marker effects) is approximated by a product of many lower dimensional distributions [23]. This method is extremely versatile and can be easily extended to a range of models proposed for multiple locus analysis [4,11,14,24]. The specific model implemented in V-Bay is a hierarchical linear model, which includes marker class partitioning control of model over-fitting. This is particularly well suited for maintaining a low false-positive rate when identifying weaker associations [13]. V-Bay implements a simultaneous analysis of all markers in a GWA study and, since the computational time complexity per iteration of V-Bay is linear with respect to sample size and marker number, the algorithm has fast convergence. For example, simultaneous analysis of a million markers, genotyped in more than a thousand individuals, can be completed using a standard desktop (with large memory capacity) in a matter of hours. We take advantage of the computational speed of V-Bay to perform a simulation study of performance, for GWA data ranging from a hundred thousand to more than a million markers. In the Results we focus on the simulation results for single population simulations, but we also implement a version of the algorithm to accommodate known population structure and missing genotype data. We demonstrate that in practice, VBay consistently and reliably identifies both strong marker associations, as well as those too weak to be identified by single-marker analysis. We also demonstrate that V-Bay can outperform a recently proposed multiple locus methods tha (...truncated)