Polygenic Modeling with Bayesian Sparse Linear Mixed Models (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.plosgenetics.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371/journal.pgen.1003264&representation=PDF

Polygenic Modeling with Bayesian Sparse Linear Mixed Models

Citation: Zhou X, Carbonetto P, Stephens M ( Polygenic Modeling with Bayesian Sparse Linear Mixed Models Xiang Zhou 0 1 Peter Carbonetto 0 1 Matthew Stephens 0 1 Peter M. Visscher, The University of Queensland, Australia 0 Funding: This work was supported by NIH grant HG02585 to MS and by NIH grant HL092206 (PI Y Gilad) and a cross-disciplinary postdoctoral fellowship from the Human Frontiers Science Program to PC. Funding for the Wellcome Trust Case Control Consortium project was provided by the Wellcome Trust under award 076113 and 085475. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript 1 1 Department of Human Genetics, University of Chicago , Chicago , Illinois, United States of America, 2 Department of Statistics, University of Chicago , Chicago, Illinois , United States of America Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a ''Bayesian sparse linear mixed model'' (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab. uchicago.edu/software.html. - Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications. For example, LMMs are often used to control for population stratification, individual relatedness, or unmeasured confounding factors when performing association tests in genetic association studies [19] and gene expression studies [1012]. They have also been used in genetic association studies to jointly analyze groups of SNPs [13,14]. Similarly, sparse regression models have been used in genomewide association analyses [1520] and in expression QTL analysis [21]. Further, both LMMs and sparse regression models have been applied to, and garnered renewed interest in, polygenic modeling in association studies. Here, by polygenic modeling we mean any attempt to relate phenotypic variation to many genetic variants simultaneously (in contrast to single-SNP tests of association). The particular polygenic modeling problems that we focus on here are estimating chip heritability, being the proportion of variance in phenotypes explained (PVE) by available genotypes [19,2224], and predicting phenotypes based on genotypes [2529]. Despite the considerable overlap in their applications, in the context of polygenic modeling, LMMs and sparse regression models are based on almost diametrically opposed assumptions. Precisely, applications of LMMs to polygenic modeling (e.g. [22]) effectively assume that every genetic variant affects the phenotype, with effect sizes normally distributed, whereas sparse regression models, such as Bayesian variable selection regression models (BVSR) [18,19], assume that a relatively small proportion of all variants affect the phenotype. The relative performance of these two models for polygenic modeling applications would therefore be expected to vary depending on the true underlying genetic architecture of the phenotype. However, in practice, one does not know the true genetic architecture, so it is unclear which of the two models to prefer. Motivated by this observation, we consider a hybrid of these two models, which we refer to as the Bayesian sparse linear mixed model, or BSLMM. This hybrid includes both the LMM and a sparse regression model, BVSR, as special cases, and is to some extent capable of learning the genetic architecture from the data, yielding good performance across a wide range of scenarios. By being adaptive to the data in this way, our approach obviates the need to choose one model over the other, and attempts to combine the benefits of both. The idea of a hybrid between LMM and sparse regression models is, in itself, not new. Indeed, models like these have been used in breeding value prediction to assist genomic selection in animal and plant breeding programs [3035], gene selection in expression analysis while controlling for batch effects [36], phenotype prediction of complex traits in model organisms and dairy cattle [3740], and more recently, mapping complex traits by jointly modeling all SNPs in structured populations [41]. Compared with these previous papers, our work makes two key contributions. First, we consider in detail the specification of The goal of polygenic modeling is to better understand the relationship between genetic variation and variation in observed characteristics, including variation in quantitative traits (e.g. cholesterol level in humans, milk production in cattle) and disease susceptibility. Improvements in polygenic modeling will help improve our understanding of this relationship and could ultimately lead to, for example, changes in clinical practice in humans or better breeding/ mating strategies in agricultural programs. Polygenic models present important challenges, both at the modeling/statistical level (what modeling assumptions produce the best results) and at the computational level (how should these models be effectively fit to data). We develop novel approaches to help tackle both these challenges, and we demonstrate the gains in accuracy that result in both simulated and real data examples. appropriate prior distributions for the hyper-parameters of the model. We particularly emphasize the benefits of estimating the hyper-parameters from the data, rather than fixing them to prespecified values to achieve the adaptive behavior mentioned above. Second, we provide a novel computational algorithm that exploits a recently described linear algebra trick for LMMs [8,9]. The resulting algorithm avoids ad hoc approximations that are sometimes made when fitting these types of model (e.g. [37,41]), and yields reliable results for datasets containing thousands of individuals and hundreds of thousands of markers. (Most pre (...truncated)