Roary: rapid large-scale prokaryote pan genome analysis
Bioinformatics, 31(22), 2015, 3691–3693
doi: 10.1093/bioinformatics/btv421
Advance Access Publication Date: 20 July 2015
Applications Note
Sequence analysis
Roary: rapid large-scale prokaryote pan genome
analysis
Andrew J. Page1,*, Carla A. Cummins1, Martin Hunt1,
Vanessa K. Wong1,2, Sandra Reuter2, Matthew T.G. Holden3,
Maria Fookes1, Daniel Falush4, Jacqueline A. Keane1 and Julian Parkhill1
Pathogen Genomics, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge,
Department of Medicine, University of Cambridge, Cambridge, 3School of Medicine, University of St. Andrews,
North Haugh, St Andrews and 4College of Medicine, Swansea University, Swansea, UK
2
*To whom correspondence should be addressed.
Associate Editor: John Hancock
Received on May 14, 2015; revised on June 26, 2015; accepted on July 14, 2015
Abstract
Summary: A typical prokaryote population sequencing study can now consist of hundreds or
thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic
structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan
genomes, identifying the core and accessory genes. Roary makes construction of the pan genome
of thousands of prokaryote samples possible on a standard desktop without compromising on the
accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.
Availability and implementation: Roary is implemented in Perl and is freely available under an
open source GPLv3 license from http://sanger-pathogens.github.io/Roary
Contact:
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The term microbial pan genome was first used in 2005 (Medini
et al., 2005) to describe the union of genes shared by genomes of
interest (Vernikos et al., 2014). Since then, availability of microbial sequencing data has grown exponentially. Aligning wholegenome-sequenced isolates to a single reference genome can fail to
incorporate non-reference sequences. By using de novo assemblies,
non-reference sequences can also be analyzed. Microbial organisms
can rapidly acquire genes from other organisms that can increase
virulence or promote antimicrobial drug resistance (Medini et al.,
2005). Gaining a better picture of the conserved genes of an organism, and the accessory genome, can lead to a better understanding of
key processes such as selection and evolution.
The construction of a pan genome is NP-hard (Nguyen et al.,
2014) with additional difficulties from real data due to contamination, fragmented assemblies and poor annotation. Therefore, any
approach must employ heuristics to produce a pan genome
C The Author 2015. Published by Oxford University Press.
V
(reviewed in Vernikos et al. 2014). The most complete standalone
pan genome tools are PanOCT (Fouts et al., 2012), which uses a
conserved gene neighborhood in addition to homology to accurately
place proteins into orthologous clusters; LS-BSR (Sahl et al., 2014)
which uses a preclustering step before running BLAST to rapidly assign genes to families and PGAP which takes annotated assemblies,
performs an all-against-all BLAST, clusters the results and produces
a pan genome (Zhao et al., 2012).
PanOCT and PGAP require an all-against-all comparison using
BLAST, with the running time growing approximately quadratically
with the size of input data and are computationally infeasible with
large datasets. They also have quadratic memory requirements,
quickly exceeding the RAM available in high performance servers
for large datasets. LS-BSR introduces a pre-clustering step that
makes it an order of magnitude faster than PGAP; however, it is less
sensitive (Sahl et al., 2014). We have developed a method to generate the pan genome of a set of related prokaryotic isolates. It works
3691
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1
3692
A.J.Page et al.
with thousands of isolates in a computationally feasible time, beginning with annotated fragmented de novo assemblies. We address the
computational issues by performing a rapid clustering of highly
similar sequences, which can reduce the running time of BLAST substantially, and carefully manage RAM usage so that it increases linearly, both of which make it possible to analyze datasets with
thousands of samples using commonly available computing
hardware.
2 Description
3 Results
We evaluated the accuracy, running time and memory usage of
Roary against three similar standalone pan genome applications. In
each case, we performed the analysis using a single processor (AMD
Opteron 6272) and provided 60 GB of RAM. We constructed a
simulated dataset based on Salmonella enterica serovar Typhi
(S.typhi) CT18 (acc. no. AL513382), allowing us to accurately assess the quality of the clustering. We created 12 genomes with 994
identical core genes and 23 accessory genes in varying combinations.
All the applications created clusters that are within 1% of the expected results, with Roary correctly building all genes as shown in
Table 1. The overlap of the clusters is virtually identical in all applications except LS-BSR, which over clusters in 2% of cases.
In addition, a set of 1000 real annotated assemblies of S.typhi
genomes was used. Subsets of the data were provided to each
Table 1. Accuracy of each pan genome application on a dataset of
simulated data
Expected
PGAP
PanOCT
LS-BSR
Roary
Core genes
Total genes
Incorrect split
Incorrect merge
994
991
993
974
994
1017
1012
1015
994
1017
0
0
1
0
0
0
4
1
23
0
Fig. 1. Effect of dataset size on the wall time of multiple applications. Only
analysis that completed within 2 days and 60 GB of RAM is shown
Table 2. Comparison of pan genome applications using real
S.typhi data (ERP001718)
Samples
Software
Corea
Total
RAM (mb)
Wall time (s)
8
PGAP
PanOCT
LS-BSR
Roary
PGAP
PanOCT
LS-BSR
Roary
PGAP
PanOCT
LS-BSR
Roary
4545
4544
4476
4459
—
4522
4451
4436
—
—
4272
4016
4929
4936
4816
4871
—
4991
4843
4941
—
—
7265
9201
569
663
270
156
—
5313
554
444
—
—
17 413
13 752
41 397
1457
2585
44
—
96 093
7807
382
—
—
345 019
15 465
24
1000
a
Core is defined as a gene being in at least 99% of samples, which allows
for some assembly errors in very large datasets. Where there are no results,
the applications failed to complete within 5 days or used more than 60 GB of
RAM. The first column is the number of unique S.typhi genomes in the input
set with a mean of 54 contigs over all 1000 assemblies.
The input to Roary is one annotated assembly per sample in
GFF3 format (Stein, 2013), such as that produced by Prokka
(Seemann, 2014), where all s (...truncated)