Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci

Molecular Biology and Evolution, Mar 2013

Effective population size is fundamental in population genetics and characterizes genetic diversity. To infer past population dynamics from molecular sequence data, coalescent-based models have been developed for Bayesian nonparametric estimation of effective population size over time. Among the most successful is a Gaussian Markov random field (GMRF) model for a single gene locus. Here, we present a generalization of the GMRF model that allows for the analysis of multilocus sequence data. Using simulated data, we demonstrate the improved performance of our method to recover true population trajectories and the time to the most recent common ancestor (TMRCA). We analyze a multilocus alignment of HIV-1 CRF02_AG gene sequences sampled from Cameroon. Our results are consistent with HIV prevalence data and uncover some aspects of the population history that go undetected in Bayesian parametric estimation. Finally, we recover an older and more reconcilable TMRCA for a classic ancient DNA data set.

Article PDF cannot be displayed. You can download it here:

https://mbe.oxfordjournals.org/content/30/3/713.full.pdf

Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci

Mandev S. Gill 2 Philippe Lemey 1 Nuno R. Faria 1 Andrew Rambaut 0 Beth Shapiro 5 Marc A. Suchard 2 3 4 0 Institute of Evolutionary Biology, University of Edinburgh , Edinburgh, United Kingdom 1 Department of Microbiology and Immunology, Rega Institute , KU Leuven, Leuven, Belgium 2 Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California , Los Angeles 3 Department of Human Genetics, David Geffen School of Medicine at UCLA , Universtiy of California, Los Angeles 4 Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California , Los Angeles 5 Department of Ecology and Evolutionary Biology, University of California , Santa Cruz Effective population size is fundamental in population genetics and characterizes genetic diversity. To infer past population dynamics from molecular sequence data, coalescent-based models have been developed for Bayesian nonparametric estimation of effective population size over time. Among the most successful is a Gaussian Markov random field (GMRF) model for a single gene locus. Here, we present a generalization of the GMRF model that allows for the analysis of multilocus sequence data. Using simulated data, we demonstrate the improved performance of our method to recover true population trajectories and the time to the most recent common ancestor (TMRCA). We analyze a multilocus alignment of HIV-1 CRF02_AG gene sequences sampled from Cameroon. Our results are consistent with HIV prevalence data and uncover some aspects of the population history that go undetected in Bayesian parametric estimation. Finally, we recover an older and more reconcilable TMRCA for a classic ancient DNA data set. - Introduction Coalescent theory has become a cornerstone of computational population genetics. First introduced by Kingman (1982), the coalescent is a stochastic process that generates genealogies relating a random sample of individuals arising from a classic forward-time population model (such as the FisherWright model). The basic assumptions on such an idealized population are a constant population size, no selection or migration, nonoverlapping generations, and an equal propensity among individuals to produce offspring. Researchers have extended coalescent theory to accommodate a range of relaxed assumptions about the population of interest, including a variable population size (Griffiths and Tavare 1994; Donnelly and Tavare 1995), and serially sampled data (Rodrigo and Felsenstein 1999). Notably, coalescentbased inference methods enable estimation of population genetic parameters from a fixed genealogy and, because genealogical shapes leave their imprints in the genomes of sampled individuals, directly from molecular sequence data (Hein et al. 2005). One parameter of great scientific interest is the effective population size over time (often called the demographic model or demographic function). The effective population size is an abstract quantity that corresponds to the population size under an idealized model of reproduction. The census population size can be recovered from the effective population size by appropriate scaling. The utility of the effective population size is that it provides a measure of genetic diversity and its fluctuations over time, and acts as a common denominator, allowing researchers to compare populations arising from different reproductive models. As recent examples, Campos et al. (2010) reconstruct the demographic history of musk ox from ancient DNA sequences to examine the cause of the reduction in their mitochondrial diversity, Rambaut et al. (2008) uncover trends of genetic diversity of the influenza A virus and compare them with the seasonal occurrence of influenza, and Faria et al. (2012) explore past population dynamics of HIV-1 CRF02_AG gene sequences sampled in Cameroon. Computational biologists and statisticians have posited a number of coalescent-based models to infer population dynamics across time. Many of these models (Kuhner et al. 1998; Drummond et al. 2002) characterize the effective population size over time using simple parametric functions (examples of such scenarios include constant size, exponentially growing, or logistically growing populations). This approach is advantageous in that there are relatively few parameters to be estimated, and hypothesis testing is convenient. However, a priori parametric functions may not accurately characterize important aspects of the population history of a given sample, and finding an appropriate parametric model can be difficult and time consuming (Baele et al. 2012). Accordingly, The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: nonparametric approaches have become popular in recent years. These approaches typically center around approximating the population history with a piecewise constant or linear function. Some of the first nonparametric models (Pybus et al. 2000; Strimmer and Pybus 2001) provide fast but noisy estimation of population trajectories from a fixed genealogy. More recent models (Drummond et al. 2005; Opgen-Rhein et al. 2005; Minin et al. 2008) estimate population trajectories jointly, along with genealogies and mutational parameters, directly from sequence data in a Bayesian framework. These models differ primarily by the priors they place on the effective population size, and the choice of prior influences not only the estimated effective population size trajectory but also the estimated genealogy (in particular, the time to the most recent common ancestor [TMRCA]). Opgen-Rhein et al. (2005) and Drummond et al. (2005) use multiple changepoint models to estimate past population dynamics. The latter model is called the Bayesian Skyline, and Heled and Drummond (2008) have developed an extension called the Extended Bayesian Skyline Plot (EBSP) model that incorporates data from multiple unlinked genetic loci. The model proposed by Minin et al. (2008), called the Skyride, exploits a Gaussian Markov random field (GMRF) prior to achieve temporal smoothing. Here, we present a novel Bayesian nonparametric model, named the Skygrid, to estimate effective population size trajectories. Like the Skyride, we parameterize the effective population size as a piecewise constant function and employ a GMRF prior to smooth the trajectory. However, while the Skyride allows the estimated trajectory to change at coalescent times, our improved method does so at prespecified fixed points in real time. This grants the user extra flexibility and provides a natural framework to extend the model in the future to incorporate covariate values. Furthermore, this distinction enables the Skygrids GMRF prior to be independent of the genealogy, which has important implications for estimation of the TMRCA. Another departure from the Skyride, and a major advantage of our model, i (...truncated)


This is a preview of a remote PDF: https://mbe.oxfordjournals.org/content/30/3/713.full.pdf
Article home page: http://mbe.oxfordjournals.org/content/30/3/713.abstract

Mandev S. Gill, Philippe Lemey, Nuno R. Faria, Andrew Rambaut, Beth Shapiro, Marc A. Suchard. Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci, Molecular Biology and Evolution, 2013, pp. 713-724, 30/3, DOI: 10.1093/molbev/mss265