Sampling strategies for frequency spectrum-based population genomic inference
John D Robinson
0
1
Alec J Coffman
2
Michael J Hickerson
1
3
Ryan N Gutenkunst
2
0
Current Address: South Carolina Department of Natural Resources, Marine Resources Research Institute
,
Charleston, SC 29412
,
USA
1
Department of Biology, City College of New York
,
New York, NY 10031
,
USA
2
Department of Molecular and Cellular Biology, University of Arizona
,
Tucson, AZ 85721
,
USA
3
Subprogram in Ecology, Evolution and Behavior, the Graduate Center of the
Background: The allele frequency spectrum (AFS) consists of counts of the number of single nucleotide polymorphism (SNP) loci with derived variants present at each given frequency in a sample. Multiple approaches have recently been developed for parameter estimation and calculation of model likelihoods based on the joint AFS from two or more populations. We conducted a simulation study of one of these approaches, implemented in the Python module ai, to compare parameter estimation and model selection accuracy given different sample sizes under one- and two-population models. Results: Our simulations included a variety of demographic models and two parameterizations that differed in the timing of events (divergence or size change). Using a number of SNPs reasonably obtained through next-generation sequencing approaches (10,000 - 50,000), accurate parameter estimates and model selection were possible for models with more ancient demographic events, even given relatively small numbers of sampled individuals. However, for recent events, larger numbers of individuals were required to achieve accuracy and precision in parameter estimates similar to that seen for models with older divergence or population size changes. We quantify i) the uncertainty in model selection, using tools from information theory, and ii) the accuracy and precision of parameter estimates, using the root mean squared error, as a function of the timing of demographic events, sample sizes used in the analysis, and complexity of the simulated models. Conclusions: Here, we illustrate the utility of the genome-wide AFS for estimating demographic history and provide recommendations to guide sampling in population genomics studies that seek to draw inference from the AFS. Our results indicate that larger samples of individuals (and thus larger AFS) provide greater power for model selection and parameter estimation for more recent demographic events.
-
Background
Population genetic data can be useful for comparing
alternative representations of demographic history and
for estimating parameter values under potentially
complex models. The declining costs associated with
nextgeneration sequencing, along with recent developments
allowing multiple individual genomes to be simultaneously
sequenced [1,2], have led to increases in the number of
researchers generating genomic-scale datasets that include
population-level samples of individuals. These datasets
have the potential to provide unprecedented insight into
the demographic history of populations and the
evolutionary history of divergence among species [3]. Analyses
based on the allele frequency spectrum (AFS) have
become increasingly popular when considering population
genomic datasets, in part due to the development of
analytical software packages that consider the joint AFS
between two or more populations [4-7].
The AFS is a P-dimensional array, where P is the
number of populations considered, that gives the number of
single nucleotide polymorphism (SNP) loci with derived
alleles present at a given joint frequency in the sampled
populations. Each dimension contains 2ni + 1 elements,
where ni is the number of diploid individuals sampled from
population i. These elements are ordered [0, 1, , 2ni]
along each dimension, and each value in the body of the
array is the number of derived variants across the sample
that are present at a given joint frequency. For instance,
considering two populations, each SNP locus contributes
one unit to the value in the AFS located at [x1, x2], where
xi is the number of derived allele copies (indexed on 0) in
samples from population i. The joint AFS is based on these
data, summed across the set of SNPs genotyped in two or
more populations.
For datasets composed of biallelic, unlinked SNPs, the
AFS is a complete summary of the data [4], and many
commonly used statistics, such as the number of
segregating sites, FST, and Tajimas D [8], can be calculated
directly from the frequency spectrum. Additionally,
patterns in the AFS can be indicative of demographic
and/or selective events in the evolutionary history of
the population or populations under consideration. For
instance, gene flow between populations increases the
correlation in allele frequencies, increasing the
proportion of variable sites that fall along the diagonal of the
AFS (Figure 1). The AFS is therefore well suited for
the analysis of population genomic data, which are
increasingly feasible to collect due to the rapid pace of
development in sequencing technologies. Estimates of
historical demography from the AFS can also be used
to provide a baseline against which tests for the
signatures of selection can be carried out [9-11]. However,
the utility of parameter estimates obtained from
analysis of the AFS will depend on their accuracy and
precision, as well as the power of the analytical framework
for model selection.
Several related computer programs have recently been
introduced to analyze joint frequency spectra from two
or more populations [4-7]. These programs differ in the
specifics of their approach to modeling the AFS, using
either diffusion approximation [4,6] or coalescent
simulations [5,7] to model the density of SNPs in cells of the
AFS. For comparisons between models and observed
data, all of these methods employ composite likelihoods,
which estimate the overall likelihood using combinations
of likelihoods calculated from independent subsets of
the data. For instance, in the context of the AFS, the
composite likelihood is the product of the likelihoods
calculated for individual cells of the spectrum. The
similarities between software packages have resulted in similar
performance of the different analytical methods in cases
where they are directly compared [5-7], although some
minor differences have also been noted [6]. Of these
alternatives, ai [4] has been most widely applied, with
applications to genomic data collected from humans [12-14],
cattle [15], rice [16], and bees [17], among others.
Here, we use a simulation study to investigate the
influences of sample size on the power for model selection
and the accuracy of parameter estimates obtained from
ai [4]. Because we employ an information-theoretic
model selection approach, our use of the term power
does not follow the standard statistical definition (the
probability of rejecting the null hypothesis). Instead we
define power as the probability of selecting the true
(simulated) model from a set of competing candidates.
Given the similarities (...truncated)