Roary: rapid large-scale prokaryote pan genome analysis (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/31/22/3691.full.pdf

Roary: rapid large-scale prokaryote pan genome analysis

Bioinformatics, 31(22), 2015, 3691–3693 doi: 10.1093/bioinformatics/btv421 Advance Access Publication Date: 20 July 2015 Applications Note Sequence analysis Roary: rapid large-scale prokaryote pan genome analysis Andrew J. Page1,*, Carla A. Cummins1, Martin Hunt1, Vanessa K. Wong1,2, Sandra Reuter2, Matthew T.G. Holden3, Maria Fookes1, Daniel Falush4, Jacqueline A. Keane1 and Julian Parkhill1 Pathogen Genomics, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, Department of Medicine, University of Cambridge, Cambridge, 3School of Medicine, University of St. Andrews, North Haugh, St Andrews and 4College of Medicine, Swansea University, Swansea, UK 2 *To whom correspondence should be addressed. Associate Editor: John Hancock Received on May 14, 2015; revised on June 26, 2015; accepted on July 14, 2015 Abstract Summary: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. Availability and implementation: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction The term microbial pan genome was first used in 2005 (Medini et al., 2005) to describe the union of genes shared by genomes of interest (Vernikos et al., 2014). Since then, availability of microbial sequencing data has grown exponentially. Aligning wholegenome-sequenced isolates to a single reference genome can fail to incorporate non-reference sequences. By using de novo assemblies, non-reference sequences can also be analyzed. Microbial organisms can rapidly acquire genes from other organisms that can increase virulence or promote antimicrobial drug resistance (Medini et al., 2005). Gaining a better picture of the conserved genes of an organism, and the accessory genome, can lead to a better understanding of key processes such as selection and evolution. The construction of a pan genome is NP-hard (Nguyen et al., 2014) with additional difficulties from real data due to contamination, fragmented assemblies and poor annotation. Therefore, any approach must employ heuristics to produce a pan genome C The Author 2015. Published by Oxford University Press. V (reviewed in Vernikos et al. 2014). The most complete standalone pan genome tools are PanOCT (Fouts et al., 2012), which uses a conserved gene neighborhood in addition to homology to accurately place proteins into orthologous clusters; LS-BSR (Sahl et al., 2014) which uses a preclustering step before running BLAST to rapidly assign genes to families and PGAP which takes annotated assemblies, performs an all-against-all BLAST, clusters the results and produces a pan genome (Zhao et al., 2012). PanOCT and PGAP require an all-against-all comparison using BLAST, with the running time growing approximately quadratically with the size of input data and are computationally infeasible with large datasets. They also have quadratic memory requirements, quickly exceeding the RAM available in high performance servers for large datasets. LS-BSR introduces a pre-clustering step that makes it an order of magnitude faster than PGAP; however, it is less sensitive (Sahl et al., 2014). We have developed a method to generate the pan genome of a set of related prokaryotic isolates. It works 3691 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 3692 A.J.Page et al. with thousands of isolates in a computationally feasible time, beginning with annotated fragmented de novo assemblies. We address the computational issues by performing a rapid clustering of highly similar sequences, which can reduce the running time of BLAST substantially, and carefully manage RAM usage so that it increases linearly, both of which make it possible to analyze datasets with thousands of samples using commonly available computing hardware. 2 Description 3 Results We evaluated the accuracy, running time and memory usage of Roary against three similar standalone pan genome applications. In each case, we performed the analysis using a single processor (AMD Opteron 6272) and provided 60 GB of RAM. We constructed a simulated dataset based on Salmonella enterica serovar Typhi (S.typhi) CT18 (acc. no. AL513382), allowing us to accurately assess the quality of the clustering. We created 12 genomes with 994 identical core genes and 23 accessory genes in varying combinations. All the applications created clusters that are within 1% of the expected results, with Roary correctly building all genes as shown in Table 1. The overlap of the clusters is virtually identical in all applications except LS-BSR, which over clusters in 2% of cases. In addition, a set of 1000 real annotated assemblies of S.typhi genomes was used. Subsets of the data were provided to each Table 1. Accuracy of each pan genome application on a dataset of simulated data Expected PGAP PanOCT LS-BSR Roary Core genes Total genes Incorrect split Incorrect merge 994 991 993 974 994 1017 1012 1015 994 1017 0 0 1 0 0 0 4 1 23 0 Fig. 1. Effect of dataset size on the wall time of multiple applications. Only analysis that completed within 2 days and 60 GB of RAM is shown Table 2. Comparison of pan genome applications using real S.typhi data (ERP001718) Samples Software Corea Total RAM (mb) Wall time (s) 8 PGAP PanOCT LS-BSR Roary PGAP PanOCT LS-BSR Roary PGAP PanOCT LS-BSR Roary 4545 4544 4476 4459 — 4522 4451 4436 — — 4272 4016 4929 4936 4816 4871 — 4991 4843 4941 — — 7265 9201 569 663 270 156 — 5313 554 444 — — 17 413 13 752 41 397 1457 2585 44 — 96 093 7807 382 — — 345 019 15 465 24 1000 a Core is defined as a gene being in at least 99% of samples, which allows for some assembly errors in very large datasets. Where there are no results, the applications failed to complete within 5 days or used more than 60 GB of RAM. The first column is the number of unique S.typhi genomes in the input set with a mean of 54 contigs over all 1000 assemblies. The input to Roary is one annotated assembly per sample in GFF3 format (Stein, 2013), such as that produced by Prokka (Seemann, 2014), where all s (...truncated)