Polygenic Modeling with Bayesian Sparse Linear Mixed Models
Citation: Zhou X, Carbonetto P, Stephens M (
Polygenic Modeling with Bayesian Sparse Linear Mixed Models
Xiang Zhou 0 1
Peter Carbonetto 0 1
Matthew Stephens 0 1
Peter M. Visscher, The University of Queensland, Australia
0 Funding: This work was supported by NIH grant HG02585 to MS and by NIH grant HL092206 (PI Y Gilad) and a cross-disciplinary postdoctoral fellowship from the Human Frontiers Science Program to PC. Funding for the Wellcome Trust Case Control Consortium project was provided by the Wellcome Trust under award 076113 and 085475. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
1 1 Department of Human Genetics, University of Chicago , Chicago , Illinois, United States of America, 2 Department of Statistics, University of Chicago , Chicago, Illinois , United States of America
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a ''Bayesian sparse linear mixed model'' (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab. uchicago.edu/software.html.
-
Both linear mixed models (LMMs) and sparse regression models
are widely used in genetics applications. For example, LMMs are
often used to control for population stratification, individual
relatedness, or unmeasured confounding factors when performing
association tests in genetic association studies [19] and gene
expression studies [1012]. They have also been used in genetic
association studies to jointly analyze groups of SNPs [13,14].
Similarly, sparse regression models have been used in
genomewide association analyses [1520] and in expression QTL analysis
[21]. Further, both LMMs and sparse regression models have been
applied to, and garnered renewed interest in, polygenic modeling
in association studies. Here, by polygenic modeling we mean any
attempt to relate phenotypic variation to many genetic variants
simultaneously (in contrast to single-SNP tests of association). The
particular polygenic modeling problems that we focus on here are
estimating chip heritability, being the proportion of variance in
phenotypes explained (PVE) by available genotypes [19,2224],
and predicting phenotypes based on genotypes [2529].
Despite the considerable overlap in their applications, in the
context of polygenic modeling, LMMs and sparse regression
models are based on almost diametrically opposed assumptions.
Precisely, applications of LMMs to polygenic modeling (e.g. [22])
effectively assume that every genetic variant affects the phenotype,
with effect sizes normally distributed, whereas sparse regression
models, such as Bayesian variable selection regression models
(BVSR) [18,19], assume that a relatively small proportion of all
variants affect the phenotype. The relative performance of these
two models for polygenic modeling applications would therefore
be expected to vary depending on the true underlying genetic
architecture of the phenotype. However, in practice, one does not
know the true genetic architecture, so it is unclear which of the two
models to prefer. Motivated by this observation, we consider a
hybrid of these two models, which we refer to as the Bayesian
sparse linear mixed model, or BSLMM. This hybrid includes
both the LMM and a sparse regression model, BVSR, as special
cases, and is to some extent capable of learning the genetic
architecture from the data, yielding good performance across a
wide range of scenarios. By being adaptive to the data in this
way, our approach obviates the need to choose one model over the
other, and attempts to combine the benefits of both.
The idea of a hybrid between LMM and sparse regression
models is, in itself, not new. Indeed, models like these have been
used in breeding value prediction to assist genomic selection in
animal and plant breeding programs [3035], gene selection in
expression analysis while controlling for batch effects [36],
phenotype prediction of complex traits in model organisms and
dairy cattle [3740], and more recently, mapping complex traits
by jointly modeling all SNPs in structured populations [41].
Compared with these previous papers, our work makes two key
contributions. First, we consider in detail the specification of
The goal of polygenic modeling is to better understand
the relationship between genetic variation and variation in
observed characteristics, including variation in quantitative
traits (e.g. cholesterol level in humans, milk production in
cattle) and disease susceptibility. Improvements in
polygenic modeling will help improve our understanding of
this relationship and could ultimately lead to, for example,
changes in clinical practice in humans or better breeding/
mating strategies in agricultural programs. Polygenic
models present important challenges, both at the
modeling/statistical level (what modeling assumptions produce
the best results) and at the computational level (how
should these models be effectively fit to data). We develop
novel approaches to help tackle both these challenges,
and we demonstrate the gains in accuracy that result in
both simulated and real data examples.
appropriate prior distributions for the hyper-parameters of the
model. We particularly emphasize the benefits of estimating the
hyper-parameters from the data, rather than fixing them to
prespecified values to achieve the adaptive behavior mentioned
above. Second, we provide a novel computational algorithm that
exploits a recently described linear algebra trick for LMMs [8,9].
The resulting algorithm avoids ad hoc approximations that are
sometimes made when fitting these types of model (e.g. [37,41]),
and yields reliable results for datasets containing thousands of
individuals and hundreds of thousands of markers. (Most pre (...truncated)