Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals
Syst. Biol. 67(3):490–502, 2018
© The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
DOI:10.1093/sysbio/syx090
Advance Access publication November 27, 2017
Effective Online Bayesian Phylogenetics via Sequential Monte Carlo
with Guided Proposals
MATHIEU FOURMENT1,∗ , BRIAN C. CLAYWELL2 , VU DINH2 , CONNOR MCCOY2 , FREDERICK A. MATSEN IV2 ,
AND AARON E. DARLING1
1 ithree
institute, University of Technology Sydney, Ultimo, NSW 2007, Australia; 2 Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
∗ Correspondence to be sent to: ithree institute, University of Technology Sydney, PO Box 123, Ultimo, NSW 2007, Australia;
E-mail: .
Received 2 June 2017; reviews returned 16 November 2017; accepted 20 November 2017
Associate Editor: Edward Suskso
Phylogenetic techniques are quickly becoming an
essential tool in the investigation and surveillance
of infectious disease outbreaks (Gardy et al. 2015;
Neher and Bedford 2015; Rusu et al. 2015). Meanwhile,
advances in DNA sequencing technology have made
the generation of complete genome data for isolates
of bacteria and viruses a routine practice in public
health laboratories. These genome data are collected
into public databases such as the FDA GenomeTrakr
(FDA 2016), which in 2016 accumulated new data at
an average rate of over 1000 pathogen genomes per
week. Sequencing technology itself continues to evolve,
with new devices based on nanopore detection capable
of generating a continuous stream of sequence data,
supporting interactive real-time analysis (Loose et al.
2016).
Ideally these new data streams would be matched
with appropriate sequence analysis tools, including
Bayesian phylogenetic inference. Bayesian inference
has particular value in epidemiological investigations
due to its ability to operate on models with a wide
range of unknown parameters, including divergence
times, lineage-specific mutation rates, population
demographics, and geography (Lemey et al. 2009;
Kühnert et al. 2014). However, all current methods for
Bayesian inference treat the data set as a static entity
that has been observed in its entirety at the time that
computation of the posterior probability distribution
begins. Updating a data set with new sequences, as
might be required when a new case of an infection is
presented and sequenced, necessitates that the entire
analysis be restarted.
Although Izquierdo-Carrasco et al. (2014) have
proposed a maximum likelihood approach to update
a phylogenetic tree with new sequences, no such tool
exists for Bayesian phylogenetic inference. Each run
using popular Bayesian phylogenetic inference tools like
MrBayes (Ronquist et al. 2012) or BEAST (Bouckaert et al.
2014) can take days or weeks of CPU time to approximate
a posterior distribution for realistic models and data sets.
The inability to quickly incorporate new data into an
existing analysis is a major impediment to the use of
Bayesian phylogenetics as a decision support tool for
infectious disease management and surveillance, where
interventions are most likely to be effective if made
within hours or days.
Recently, Dinh et al. (2016) described a theoretical
framework for updating a phylogenetic posterior
approximation, called Online Phylogenetic Sequential
Monte Carlo (OPSMC). An overview of OPSMC is
given in Figure 1. At each generation, a population
of particles representing a posterior sample of trees
on n−1 sequences is updated to give a sample from
the corresponding posterior on n sequences. Optionally,
one or more Metropolis–Hastings steps (not shown
in the figure) can be applied to increase the effective
sample size. Dinh et al. (2016) show consistency of
OPSMC in terms of weak convergence: as the number
of particles goes to infinity, the weighted average of
a test function over a collection of particles converges
to the integral of that test function with respect to the
posterior distribution. In addition, the effective sample
size (ESS) (defined below) is bounded below by a
constant multiple of the number of particles. However,
Abstract.—Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require
phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly
incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary
stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential
Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of
the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential
Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior
suffers from poor performance, and we develop “guided” proposals that better match the proposal density to the posterior.
Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to
poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the
widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic
posteriors as sequences arrive can be significantly reduced by the use of OPSMC, without incurring a significant loss in
accuracy. [Bayesian inference; online inference; phylogenetics; sequential Monte Carlo.]
490
[18:26 14/4/2018 Sysbio-OP-SYSB170092.tex]
Page: 490
490–503
2018
FOURMENT ET AL.—ONLINE BAYESIAN PHYLOGENETICS
491
even given these attractive theoretical properties, it was
not clear if OPSMC could be translated into a competitive
sampler. In addition, more research is needed on the
design of effective transition kernels for OPSMC, a
subject of some debate in related literature (Teh et al.
2008; Bouchard-Côté et al. 2012).
In this work, we implement OPSMC with a variety
of transition kernels and compare their ability to
efficiently update phylogenetic posteriors with new
data. In particular, we compare the efficiency of naïve
proposals to guided proposals showing that the extra
effort required to compute a guided proposal leads to
a significant overall improvement in sampler efficiency.
Finally, we discuss prospects for the incorporation of
OPSMC into widely used algorithms and software
packages for Bayesian phylogenetic inference. For this
article, we restrict ourselves to “pure” SMC without
Metropolis–Hastings steps. Our implementation is
available at (...truncated)