Gaussian process test for high-throughput sequencing time series: application to experimental evolution (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/31/11/1762/50314391/bioinformatics_31_11_1762.pdf

Gaussian process test for high-throughput sequencing time series: application to experimental evolution

Bioinformatics, 31(11), 2015, 1762–1770 doi: 10.1093/bioinformatics/btv014 Advance Access Publication Date: 21 January 2015 Original paper Genetics and population analysis Gaussian process test for high-throughput sequencing time series: application to experimental evolution Hande Topa1,*,†, Ágnes Jónás2,3,*,†, Robert Kofler2, Carolin Kosiol2,* Antti Honkela4,* 1 Helsinki Institute for Information Technology (HIIT), Department of Information and Computer Science, Aalto University, Espoo, Finland, 2Institut für Populationsgenetik, Vetmeduni Vienna, 1210 Wien, Austria, 3Vienna Graduate School of Population Genetics, Wien, Austria and 4Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki, Finland *To whom correspondence should be addressed. † The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. Associate Editor: Alfonso Valencia Received on March 13, 2014; revised on November 7, 2014; accepted on January 7, 2015 Abstract Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but also monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. Results: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation. Availability and implementation: R software implementing the test is available at https://github. com/handetopa/BBGP. Contact: , , , Supplementary information: Supplementary data are available at Bioinformatics online. C The Author 2015. Published by Oxford University Press. V 1762 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Gaussian process test for high-throughput sequencing time series 1 Introduction Stumpf, 2009; Liu et al., 2010; Liu and Niranjan, 2012; Stegle et al., 2010; Titsias et al., 2012; Yuan, 2006). In differential analysis, GPs have been applied to detect differences in gene expression time series in a two-sample setting by Stegle et al. (2010) and for detecting significant changes by Kalaitzis and Lawrence (2011). Although these methods provide a sensible basis for detecting the changing alleles, they fail to properly take into account all aspects of the available HTS data, such as differences in sequencing depth between different alleles and time points. These differences can have a huge impact in the reliability of different measured allele frequencies and taking them into account is vital for achieving good accuracy with the available short time series. 2 Methods To identify the candidate alleles which evolve under selection, we model the allele frequencies by GP regression. We fit time-dependent and time-independent GP models and rank the alleles according to their corresponding Bayes factors (BFs), i.e. the ratio of the marginal likelihoods under the different models. GPs provide a convenient approach for modelling short time series. However, when applying them to a large number of short parallel time series as in many genomic applications, naive application leads to overfitting or underfitting in some examples. Although these problems are rare, the bad examples can easily dominate the ranking. We overcome these challenges by excluding non-sensical parameter values, for example using a good variance model that can be incorporated into the GP models. 2.1 Data and preprocessing In the following, we use the term SNP for the markers and alleles under study, but the methods can be applied to any features whose abundance can be quantified in a similar manner. We consider SNPs that are bi-allelic for a specific position of the genome in a population. Multi-allelic SNPs, however, exist but are rare and likely to be sequencing errors (Burke et al., 2010). Multi-allelic cases can be treated by simply ignoring the least frequent allele or transformed to bi-allelic site by summing up the frequencies of the most infrequent alleles. Here, we assumed that only two of the alleles from (A, T, C, G) can be observed at each SNP position. After determining the abundances of these two specific alleles, we model the time dependency of the rising allele’s frequency over several generations. We will refer to generations as time points for simplicity. We denote the replicate index of each observation by rj and the time point by tj, j ¼ 1, . . . , J, with J denoting the total number of observations. For each of these points, we assume HTS reads have been aligned to a reference genome with yij reads with a specific allele at SNP position i. We use nij to denote the total sequencing depth at the position. 2.2 Mean and variance inference: beta-binomial model We model yij as a draw from a binomial distribution with parameters nij and pij: yij jnij ; pij Binðnij ; pij Þ; (1) where pij denotes the frequency of the specific allele in the population. We set a uniform Beta(1,1) prior on pij: pij ja; b Betaða; bÞ; where a ¼ 1, b ¼ 1. (2) Most biological processes are dynamic and analysis of time series data is necessary to understand them. Recent advances in highthroughput sequencing (HTS) technologies have provided new experimental approaches to collect genome-wide time series. For example, experimental evolution now uses a new evolve and resequencing (ER) approach to understand which genes are targeted by selection and how (Burke and Long, 2012; Kawecki et al., 2012). Such experiments enable phenotypic divergence to be forced in response to changes in only few environmental conditions in the laboratory while other conditions (...truncated)