Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
Ahrenfeldt et al. BMC Genomics (2017) 18:19
DOI 10.1186/s12864-016-3407-6
RESEARCH ARTICLE
Open Access
Bacterial whole genome-based phylogeny:
construction of a new benchmarking
dataset and assessment of some existing
methods
Johanne Ahrenfeldt1* , Carina Skaarup1, Henrik Hasman2, Anders Gorm Pedersen1, Frank Møller Aarestrup3
and Ole Lund1
Abstract
Background: Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious
diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a
need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study
we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for
epidemiological data, and also present an analysis where we use the data to compare the performance of some
current methods.
Results: Our aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected
during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab
for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is
a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced
samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining
50 correspond to leaves.
We also used the newly created data set to compare three different online available methods that infer
phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication
when comparing the output of these methods with the known phylogeny is that phylogenetic methods
typically build trees where all observed sequences are placed as leafs, even though some of them are in fact
ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches
(thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes
into account the identity of both internal and leaf nodes.
Conclusions: Based on this analysis we find that, among the investigated methods, CSI Phylogeny had the
best performance, correctly identifying 73% of all branches in the tree and 71% of all clades.
We have made all data from this experiment (raw sequencing reads, consensus whole-genome sequences, as
well as descriptions of the known phylogeny in a variety of formats) publicly available, with the hope that
other groups may find this data useful for benchmarking and exploring the performance of epidemiological
methods. All data is freely available at: https://cge.cbs.dtu.dk/services/evolution_data.php.
Keywords: Phylogeny, Evolution, Benchmark, WGS
* Correspondence: ;
1
Center for Biological Sequence Analysis, DTU Bioinformatics, Technical
University of Denmark, Kongens Lyngby, Denmark
Full list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Ahrenfeldt et al. BMC Genomics (2017) 18:19
Background
The ability to detect and track outbreaks of infectious
diseases is of vital importance to maintain public health.
The advances of Next Generation Sequencing (NGS)
technology has led to decreasing cost and growing speed
of Whole Genome Sequencing (WGS) [1]. Due to this,
the technology has gained increasing importance in routine clinical microbiology and for studying and detecting
outbreaks and epidemics [2–4]. Various studies have
shown that inference of the phylogenetic relationship
between WGS isolates is helpful for determining epidemiological relationships [5, 6], and a number of
methods for inferring phylogenies directly from NGS
data have been created. Methods available online which
accept raw reads data include snpTree [7], NDtree [8, 9]
and CSI Phylogeny [10] available from Center for Genomic Epidemiology. Furthermore REALPHY from the
Swiss Institute of Bioinformatics is also online available
and can be downloaded for local installation [11]. In
addition to this many groups are building in-house pipelines for outbreak detection.
There were two main goals of the present study: (1) to
create a data set that could be used to benchmark NGSbased methods for epidemiological data, and (2) to use
this for comparing the performance of some current
methods. We wanted the benchmark data set to mimic
NGS data of the sort that might be collected during an
outbreak of an infectious disease. This was achieved by
letting an E. coli hypermutator strain grow in the lab for
8 consecutive days. Each day all growing cultures were
divided in two, and samples were taken for sequencing.
The result was a total of 255 samples corresponding to
both internal (ancestral) and external (leaf ) nodes on a
bifurcating phylogenetic tree.
To the best of our knowledge there is currently no
other large scale in vitro WGS data sets with known
phylogeny for evaluation of WGS phylogeny methods,
and it is our hope that this data will prove useful for
benchmarking and optimization of future methods. The
group of Richard Lenski at Michigan State University
has performed a long-term experimental evolution project, that has now been running since 1988 [12, 13], and
which might also be useful for this purpose, although
only a limited number of full genome sequences are so
far available.
Results
Escherichia coli hypermutator strain
To ensure a measureable difference between each sequenced sample in the data set, the experiment was set
up to give a high probability of observing at least one
mutation between each sequenced culture sample. Wild
type E. coli has a mutation rate around 10−3 mutations
per genome per generation [14] corresponding to about
Page 2 of 13
0.05 mutations per genome per day at a generation time
of 30 min [15]. At this rate each sample would have to
grow for 20 days to undergo an average of one mutation
per genome. The E. coli hypermutator strain CSH114,
on the other hand, has been reported to have a mutation rate that is about 100–1000 fold higher due to a
mutation in the mutT gene which makes it prone to
AT → GC mutations [14, 16]. Using an assay based
on the frequency of spontaneous development of
Rifampicin resistance (see Methods), we estimated the
mutation rate of the hypermutator strain to be about (...truncated)