Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/sysbio/article-pdf/67/3/490/25013345/syx090.pdf

Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals

Syst. Biol. 67(3):490–502, 2018 © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. DOI:10.1093/sysbio/syx090 Advance Access publication November 27, 2017 Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals MATHIEU FOURMENT1,∗ , BRIAN C. CLAYWELL2 , VU DINH2 , CONNOR MCCOY2 , FREDERICK A. MATSEN IV2 , AND AARON E. DARLING1 1 ithree institute, University of Technology Sydney, Ultimo, NSW 2007, Australia; 2 Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA ∗ Correspondence to be sent to: ithree institute, University of Technology Sydney, PO Box 123, Ultimo, NSW 2007, Australia; E-mail: . Received 2 June 2017; reviews returned 16 November 2017; accepted 20 November 2017 Associate Editor: Edward Suskso Phylogenetic techniques are quickly becoming an essential tool in the investigation and surveillance of infectious disease outbreaks (Gardy et al. 2015; Neher and Bedford 2015; Rusu et al. 2015). Meanwhile, advances in DNA sequencing technology have made the generation of complete genome data for isolates of bacteria and viruses a routine practice in public health laboratories. These genome data are collected into public databases such as the FDA GenomeTrakr (FDA 2016), which in 2016 accumulated new data at an average rate of over 1000 pathogen genomes per week. Sequencing technology itself continues to evolve, with new devices based on nanopore detection capable of generating a continuous stream of sequence data, supporting interactive real-time analysis (Loose et al. 2016). Ideally these new data streams would be matched with appropriate sequence analysis tools, including Bayesian phylogenetic inference. Bayesian inference has particular value in epidemiological investigations due to its ability to operate on models with a wide range of unknown parameters, including divergence times, lineage-speciﬁc mutation rates, population demographics, and geography (Lemey et al. 2009; Kühnert et al. 2014). However, all current methods for Bayesian inference treat the data set as a static entity that has been observed in its entirety at the time that computation of the posterior probability distribution begins. Updating a data set with new sequences, as might be required when a new case of an infection is presented and sequenced, necessitates that the entire analysis be restarted. Although Izquierdo-Carrasco et al. (2014) have proposed a maximum likelihood approach to update a phylogenetic tree with new sequences, no such tool exists for Bayesian phylogenetic inference. Each run using popular Bayesian phylogenetic inference tools like MrBayes (Ronquist et al. 2012) or BEAST (Bouckaert et al. 2014) can take days or weeks of CPU time to approximate a posterior distribution for realistic models and data sets. The inability to quickly incorporate new data into an existing analysis is a major impediment to the use of Bayesian phylogenetics as a decision support tool for infectious disease management and surveillance, where interventions are most likely to be effective if made within hours or days. Recently, Dinh et al. (2016) described a theoretical framework for updating a phylogenetic posterior approximation, called Online Phylogenetic Sequential Monte Carlo (OPSMC). An overview of OPSMC is given in Figure 1. At each generation, a population of particles representing a posterior sample of trees on n−1 sequences is updated to give a sample from the corresponding posterior on n sequences. Optionally, one or more Metropolis–Hastings steps (not shown in the ﬁgure) can be applied to increase the effective sample size. Dinh et al. (2016) show consistency of OPSMC in terms of weak convergence: as the number of particles goes to inﬁnity, the weighted average of a test function over a collection of particles converges to the integral of that test function with respect to the posterior distribution. In addition, the effective sample size (ESS) (deﬁned below) is bounded below by a constant multiple of the number of particles. However, Abstract.—Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior suffers from poor performance, and we develop “guided” proposals that better match the proposal density to the posterior. Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic posteriors as sequences arrive can be signiﬁcantly reduced by the use of OPSMC, without incurring a signiﬁcant loss in accuracy. [Bayesian inference; online inference; phylogenetics; sequential Monte Carlo.] 490 [18:26 14/4/2018 Sysbio-OP-SYSB170092.tex] Page: 490 490–503 2018 FOURMENT ET AL.—ONLINE BAYESIAN PHYLOGENETICS 491 even given these attractive theoretical properties, it was not clear if OPSMC could be translated into a competitive sampler. In addition, more research is needed on the design of effective transition kernels for OPSMC, a subject of some debate in related literature (Teh et al. 2008; Bouchard-Côté et al. 2012). In this work, we implement OPSMC with a variety of transition kernels and compare their ability to efﬁciently update phylogenetic posteriors with new data. In particular, we compare the efﬁciency of naïve proposals to guided proposals showing that the extra effort required to compute a guided proposal leads to a signiﬁcant overall improvement in sampler efﬁciency. Finally, we discuss prospects for the incorporation of OPSMC into widely used algorithms and software packages for Bayesian phylogenetic inference. For this article, we restrict ourselves to “pure” SMC without Metropolis–Hastings steps. Our implementation is available at (...truncated)