Sampling strategies for frequency spectrum-based population genomic inference (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/s12862-014-0254-4.pdf

Sampling strategies for frequency spectrum-based population genomic inference

John D Robinson 0 1 Alec J Coffman 2 Michael J Hickerson 1 3 Ryan N Gutenkunst 2 0 Current Address: South Carolina Department of Natural Resources, Marine Resources Research Institute , Charleston, SC 29412 , USA 1 Department of Biology, City College of New York , New York, NY 10031 , USA 2 Department of Molecular and Cellular Biology, University of Arizona , Tucson, AZ 85721 , USA 3 Subprogram in Ecology, Evolution and Behavior, the Graduate Center of the Background: The allele frequency spectrum (AFS) consists of counts of the number of single nucleotide polymorphism (SNP) loci with derived variants present at each given frequency in a sample. Multiple approaches have recently been developed for parameter estimation and calculation of model likelihoods based on the joint AFS from two or more populations. We conducted a simulation study of one of these approaches, implemented in the Python module ai, to compare parameter estimation and model selection accuracy given different sample sizes under one- and two-population models. Results: Our simulations included a variety of demographic models and two parameterizations that differed in the timing of events (divergence or size change). Using a number of SNPs reasonably obtained through next-generation sequencing approaches (10,000 - 50,000), accurate parameter estimates and model selection were possible for models with more ancient demographic events, even given relatively small numbers of sampled individuals. However, for recent events, larger numbers of individuals were required to achieve accuracy and precision in parameter estimates similar to that seen for models with older divergence or population size changes. We quantify i) the uncertainty in model selection, using tools from information theory, and ii) the accuracy and precision of parameter estimates, using the root mean squared error, as a function of the timing of demographic events, sample sizes used in the analysis, and complexity of the simulated models. Conclusions: Here, we illustrate the utility of the genome-wide AFS for estimating demographic history and provide recommendations to guide sampling in population genomics studies that seek to draw inference from the AFS. Our results indicate that larger samples of individuals (and thus larger AFS) provide greater power for model selection and parameter estimation for more recent demographic events. - Background Population genetic data can be useful for comparing alternative representations of demographic history and for estimating parameter values under potentially complex models. The declining costs associated with nextgeneration sequencing, along with recent developments allowing multiple individual genomes to be simultaneously sequenced [1,2], have led to increases in the number of researchers generating genomic-scale datasets that include population-level samples of individuals. These datasets have the potential to provide unprecedented insight into the demographic history of populations and the evolutionary history of divergence among species [3]. Analyses based on the allele frequency spectrum (AFS) have become increasingly popular when considering population genomic datasets, in part due to the development of analytical software packages that consider the joint AFS between two or more populations [4-7]. The AFS is a P-dimensional array, where P is the number of populations considered, that gives the number of single nucleotide polymorphism (SNP) loci with derived alleles present at a given joint frequency in the sampled populations. Each dimension contains 2ni + 1 elements, where ni is the number of diploid individuals sampled from population i. These elements are ordered [0, 1, , 2ni] along each dimension, and each value in the body of the array is the number of derived variants across the sample that are present at a given joint frequency. For instance, considering two populations, each SNP locus contributes one unit to the value in the AFS located at [x1, x2], where xi is the number of derived allele copies (indexed on 0) in samples from population i. The joint AFS is based on these data, summed across the set of SNPs genotyped in two or more populations. For datasets composed of biallelic, unlinked SNPs, the AFS is a complete summary of the data [4], and many commonly used statistics, such as the number of segregating sites, FST, and Tajimas D [8], can be calculated directly from the frequency spectrum. Additionally, patterns in the AFS can be indicative of demographic and/or selective events in the evolutionary history of the population or populations under consideration. For instance, gene flow between populations increases the correlation in allele frequencies, increasing the proportion of variable sites that fall along the diagonal of the AFS (Figure 1). The AFS is therefore well suited for the analysis of population genomic data, which are increasingly feasible to collect due to the rapid pace of development in sequencing technologies. Estimates of historical demography from the AFS can also be used to provide a baseline against which tests for the signatures of selection can be carried out [9-11]. However, the utility of parameter estimates obtained from analysis of the AFS will depend on their accuracy and precision, as well as the power of the analytical framework for model selection. Several related computer programs have recently been introduced to analyze joint frequency spectra from two or more populations [4-7]. These programs differ in the specifics of their approach to modeling the AFS, using either diffusion approximation [4,6] or coalescent simulations [5,7] to model the density of SNPs in cells of the AFS. For comparisons between models and observed data, all of these methods employ composite likelihoods, which estimate the overall likelihood using combinations of likelihoods calculated from independent subsets of the data. For instance, in the context of the AFS, the composite likelihood is the product of the likelihoods calculated for individual cells of the spectrum. The similarities between software packages have resulted in similar performance of the different analytical methods in cases where they are directly compared [5-7], although some minor differences have also been noted [6]. Of these alternatives, ai [4] has been most widely applied, with applications to genomic data collected from humans [12-14], cattle [15], rice [16], and bees [17], among others. Here, we use a simulation study to investigate the influences of sample size on the power for model selection and the accuracy of parameter estimates obtained from ai [4]. Because we employ an information-theoretic model selection approach, our use of the term power does not follow the standard statistical definition (the probability of rejecting the null hypothesis). Instead we define power as the probability of selecting the true (simulated) model from a set of competing candidates. Given the similarities (...truncated)