Gaussian process test for high-throughput sequencing time series: application to experimental evolution
Bioinformatics, 31(11), 2015, 1762–1770
doi: 10.1093/bioinformatics/btv014
Advance Access Publication Date: 21 January 2015
Original paper
Genetics and population analysis
Gaussian process test for high-throughput
sequencing time series: application to
experimental evolution
Hande Topa1,*,†, Ágnes Jónás2,3,*,†, Robert Kofler2, Carolin Kosiol2,*
Antti Honkela4,*
1
Helsinki Institute for Information Technology (HIIT), Department of Information and Computer Science, Aalto
University, Espoo, Finland, 2Institut für Populationsgenetik, Vetmeduni Vienna, 1210 Wien, Austria, 3Vienna
Graduate School of Population Genetics, Wien, Austria and 4Helsinki Institute for Information Technology (HIIT),
Department of Computer Science, University of Helsinki, Helsinki, Finland
*To whom correspondence should be addressed.
†
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Associate Editor: Alfonso Valencia
Received on March 13, 2014; revised on November 7, 2014; accepted on January 7, 2015
Abstract
Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at
one time point but also monitor them changing over time with the aim of identifying significant
changes in their abundance. In population genetics, for example, allele frequencies are monitored
over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty
specific to HTS such as sequencing depth.
Results: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model
to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution
data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation.
Availability and implementation: R software implementing the test is available at https://github.
com/handetopa/BBGP.
Contact: , , ,
Supplementary information: Supplementary data are available at Bioinformatics online.
C The Author 2015. Published by Oxford University Press.
V
1762
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Gaussian process test for high-throughput sequencing time series
1 Introduction
Stumpf, 2009; Liu et al., 2010; Liu and Niranjan, 2012; Stegle et al.,
2010; Titsias et al., 2012; Yuan, 2006). In differential analysis, GPs
have been applied to detect differences in gene expression time series
in a two-sample setting by Stegle et al. (2010) and for detecting significant changes by Kalaitzis and Lawrence (2011). Although these
methods provide a sensible basis for detecting the changing alleles,
they fail to properly take into account all aspects of the available
HTS data, such as differences in sequencing depth between different
alleles and time points. These differences can have a huge impact in
the reliability of different measured allele frequencies and taking
them into account is vital for achieving good accuracy with the
available short time series.
2 Methods
To identify the candidate alleles which evolve under selection, we
model the allele frequencies by GP regression. We fit time-dependent
and time-independent GP models and rank the alleles according to
their corresponding Bayes factors (BFs), i.e. the ratio of the marginal
likelihoods under the different models.
GPs provide a convenient approach for modelling short time series. However, when applying them to a large number of short parallel time series as in many genomic applications, naive application
leads to overfitting or underfitting in some examples. Although these
problems are rare, the bad examples can easily dominate the ranking. We overcome these challenges by excluding non-sensical parameter values, for example using a good variance model that can be
incorporated into the GP models.
2.1 Data and preprocessing
In the following, we use the term SNP for the markers and alleles
under study, but the methods can be applied to any features whose
abundance can be quantified in a similar manner. We consider SNPs
that are bi-allelic for a specific position of the genome in a population. Multi-allelic SNPs, however, exist but are rare and likely to be
sequencing errors (Burke et al., 2010). Multi-allelic cases can be
treated by simply ignoring the least frequent allele or transformed
to bi-allelic site by summing up the frequencies of the most
infrequent alleles. Here, we assumed that only two of the alleles
from (A, T, C, G) can be observed at each SNP position. After determining the abundances of these two specific alleles, we model
the time dependency of the rising allele’s frequency over several
generations. We will refer to generations as time points for
simplicity.
We denote the replicate index of each observation by rj and the
time point by tj, j ¼ 1, . . . , J, with J denoting the total number of observations. For each of these points, we assume HTS reads have
been aligned to a reference genome with yij reads with a specific allele at SNP position i. We use nij to denote the total sequencing
depth at the position.
2.2 Mean and variance inference: beta-binomial model
We model yij as a draw from a binomial distribution with parameters nij and pij:
yij jnij ; pij Binðnij ; pij Þ;
(1)
where pij denotes the frequency of the specific allele in the population. We set a uniform Beta(1,1) prior on pij:
pij ja; b Betaða; bÞ;
where a ¼ 1, b ¼ 1.
(2)
Most biological processes are dynamic and analysis of time series
data is necessary to understand them. Recent advances in highthroughput sequencing (HTS) technologies have provided new
experimental approaches to collect genome-wide time series. For
example, experimental evolution now uses a new evolve and resequencing (ER) approach to understand which genes are targeted
by selection and how (Burke and Long, 2012; Kawecki et al., 2012).
Such experiments enable phenotypic divergence to be forced in response to changes in only few environmental conditions in the laboratory while other conditions (...truncated)