Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci
Mandev S. Gill
2
Philippe Lemey
1
Nuno R. Faria
1
Andrew Rambaut
0
Beth Shapiro
5
Marc A. Suchard
2
3
4
0
Institute of Evolutionary Biology, University of Edinburgh
,
Edinburgh, United Kingdom
1
Department of Microbiology and Immunology, Rega Institute
, KU Leuven,
Leuven, Belgium
2
Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California
,
Los Angeles
3
Department of Human Genetics, David Geffen School of Medicine at UCLA
, Universtiy of California,
Los Angeles
4
Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California
,
Los Angeles
5
Department of Ecology and Evolutionary Biology, University of California
, Santa Cruz
Effective population size is fundamental in population genetics and characterizes genetic diversity. To infer past population dynamics from molecular sequence data, coalescent-based models have been developed for Bayesian nonparametric estimation of effective population size over time. Among the most successful is a Gaussian Markov random field (GMRF) model for a single gene locus. Here, we present a generalization of the GMRF model that allows for the analysis of multilocus sequence data. Using simulated data, we demonstrate the improved performance of our method to recover true population trajectories and the time to the most recent common ancestor (TMRCA). We analyze a multilocus alignment of HIV-1 CRF02_AG gene sequences sampled from Cameroon. Our results are consistent with HIV prevalence data and uncover some aspects of the population history that go undetected in Bayesian parametric estimation. Finally, we recover an older and more reconcilable TMRCA for a classic ancient DNA data set.
-
Introduction
Coalescent theory has become a cornerstone of
computational population genetics. First introduced by Kingman
(1982), the coalescent is a stochastic process that generates
genealogies relating a random sample of individuals arising
from a classic forward-time population model (such as the
FisherWright model). The basic assumptions on such an
idealized population are a constant population size, no
selection or migration, nonoverlapping generations, and an equal
propensity among individuals to produce offspring.
Researchers have extended coalescent theory to
accommodate a range of relaxed assumptions about the population
of interest, including a variable population size (Griffiths and
Tavare 1994; Donnelly and Tavare 1995), and serially sampled
data (Rodrigo and Felsenstein 1999). Notably,
coalescentbased inference methods enable estimation of population
genetic parameters from a fixed genealogy and, because
genealogical shapes leave their imprints in the genomes of
sampled individuals, directly from molecular sequence data
(Hein et al. 2005).
One parameter of great scientific interest is the effective
population size over time (often called the demographic
model or demographic function). The effective population
size is an abstract quantity that corresponds to the
population size under an idealized model of reproduction. The
census population size can be recovered from the effective
population size by appropriate scaling. The utility of the
effective population size is that it provides a measure of genetic
diversity and its fluctuations over time, and acts as a
common denominator, allowing researchers to compare
populations arising from different reproductive models. As recent
examples, Campos et al. (2010) reconstruct the demographic
history of musk ox from ancient DNA sequences to examine
the cause of the reduction in their mitochondrial diversity,
Rambaut et al. (2008) uncover trends of genetic diversity of
the influenza A virus and compare them with the seasonal
occurrence of influenza, and Faria et al. (2012) explore past
population dynamics of HIV-1 CRF02_AG gene sequences
sampled in Cameroon.
Computational biologists and statisticians have posited a
number of coalescent-based models to infer population
dynamics across time. Many of these models (Kuhner et al. 1998;
Drummond et al. 2002) characterize the effective population
size over time using simple parametric functions (examples of
such scenarios include constant size, exponentially growing,
or logistically growing populations). This approach is
advantageous in that there are relatively few parameters to be
estimated, and hypothesis testing is convenient. However, a
priori parametric functions may not accurately characterize
important aspects of the population history of a given sample,
and finding an appropriate parametric model can be difficult
and time consuming (Baele et al. 2012). Accordingly,
The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail:
nonparametric approaches have become popular in recent
years. These approaches typically center around
approximating the population history with a piecewise constant or linear
function. Some of the first nonparametric models (Pybus et al.
2000; Strimmer and Pybus 2001) provide fast but noisy
estimation of population trajectories from a fixed genealogy.
More recent models (Drummond et al. 2005; Opgen-Rhein
et al. 2005; Minin et al. 2008) estimate population trajectories
jointly, along with genealogies and mutational parameters,
directly from sequence data in a Bayesian framework. These
models differ primarily by the priors they place on the
effective population size, and the choice of prior influences not only
the estimated effective population size trajectory but also the
estimated genealogy (in particular, the time to the most
recent common ancestor [TMRCA]). Opgen-Rhein et al.
(2005) and Drummond et al. (2005) use multiple
changepoint models to estimate past population dynamics. The
latter model is called the Bayesian Skyline, and Heled and
Drummond (2008) have developed an extension called the
Extended Bayesian Skyline Plot (EBSP) model that
incorporates data from multiple unlinked genetic loci. The model
proposed by Minin et al. (2008), called the Skyride, exploits
a Gaussian Markov random field (GMRF) prior to achieve
temporal smoothing.
Here, we present a novel Bayesian nonparametric model,
named the Skygrid, to estimate effective population size
trajectories. Like the Skyride, we parameterize the effective
population size as a piecewise constant function and employ a
GMRF prior to smooth the trajectory. However, while the
Skyride allows the estimated trajectory to change at
coalescent times, our improved method does so at prespecified
fixed points in real time. This grants the user extra flexibility
and provides a natural framework to extend the model in the
future to incorporate covariate values. Furthermore, this
distinction enables the Skygrids GMRF prior to be independent
of the genealogy, which has important implications for
estimation of the TMRCA. Another departure from the Skyride,
and a major advantage of our model, i (...truncated)