Mugsy: fast multiple alignment of closely related whole genomes (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/27/3/334/48864827/bioinformatics_27_3_334.pdf

Mugsy: fast multiple alignment of closely related whole genomes

BIOINFORMATICS ORIGINAL PAPER Sequence analysis Vol. 27 no. 3 2011, pages 334–342 doi:10.1093/bioinformatics/btq665 Advance Access publication December 9, 2010 Mugsy: fast multiple alignment of closely related whole genomes Samuel V. Angiuoli1,2,∗ and Steven L. Salzberg1 1 Center for Bioinformatics and Computational Biology, University of Maryland, College Park and 2 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA Associate Editor: Dmitrij Frishman Received on June 23, 2010; revised on November 29, 2010; accepted on November 30, 2010 1 INTRODUCTION There are numerous sequenced genomes from organisms spanning across the tree of life. This number of genomes is expected to continue to grow dramatically in coming years due to advances in sequencing technologies and decreasing costs. For particular populations of interest, many individual genomes will be sequenced to study genetic diversity. The Cancer Genome Atlas, 1000 Genomes Project and the Personal Genome Project will generate genome sequences from at least several thousand people. For bacterial genomes, there are already over one thousand complete bacterial genomes in public databases. Often, a pan-genome concept is needed to describe a species or population (Medini et al., 2005), requiring multiple sequenced genomes from the same species. There are already nine bacterial species with ten or more sequenced genomes in a recent version of RefSeq. Hundreds of individual sequenced genomes are expected for some medically relevant ∗ To whom correspondence should be addressed. species and model organisms, such as Escherichia coli. Many of these genomes will be in the form of “draft” genomes, where the sequencing reads are assembled into numerous contigs that together represent a fraction of the actual genome, but are incomplete and contain physical sequencing gaps. In order to make use of this explosive growth in the number of sequenced genomes, the scientific community requires tools that can quickly compare large numbers of long and highly similar sequences from whole genomes. Whole-genome alignment has become instrumental for studying genome evolution and genetic diversity (Batzoglou, 2005; Dewey and Pachter, 2006). There are a number of whole-genome alignment tools that can align multiple whole genomes (Blanchette et al., 2004; Darling et al., 2004; Dubchak et al., 2009; Hohl et al., 2002; Paten et al., 2008). Whole-genome alignment tools are distinguished from collinear multiple sequence alignment tools, such as tools of (Bradley et al., 2009; Edgar, 2004; Thompson et al., 1994), in that they can align very long sequences, millions of base pairs in length, detecting the presence of rearrangements, duplications, and large-scale sequence gain and loss. The resulting alignments can be utilized to build phylogenies, determine orthology, find recently duplicated regions, and identify species-specific DNA. For divergent sequences, alignment accuracy is difficult to assess and popular methods can disagree, such was demonstrated by the relatively low level of agreement between outputs for the ENCODE regions (Chen and Tompa, 2010; Margulies et al., 2007). Given the difficulties in assessing accuracy, recent development has included methods that are statistically motivated and show improved specificity ( Bradley et al., 2009; Paten et al., 2008). At shorter evolutionary distances with large fractions of identical sequences, there is less ambiguity in alignment outcomes. Yet, even within a bacterial species, aligning multiple genomes is not a trivial task, especially if the sequences contain rearrangements, duplications and exhibit sequence gain and loss. Also, despite relatively short chromosome lengths for bacteria, typically a few million base pairs, the computational complexity of multiple sequence alignment makes it a formidable challenge. Calculation of multiple alignments with a simple sum of pairs scoring scheme is known to be an NP-hard problem (Elias, 2006), which makes calculation of an exact solution infeasible for large inputs. Multiple genome alignment tools rely on heuristics to achieve reasonable run times. There are numerous methods to compare a single pair of wholegenome sequences (Bray et al., 2003; Schwartz et al., 2003). The Nucmer and MUMmer package is a fast whole-genome alignment method that utilizes a suffix tree to seed an alignment with maximal unique matches (MUMs) (Kurtz et al., 2004). The suffix tree implementation of MUMmer is especially efficient and can be both ABSTRACT Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution. Results: We present a new multiple alignment tool for whole genomes named Mugsy. Mugsy is computationally efﬁcient and can align 31 Streptococcus pneumoniae genomes in less than 2 hours producing alignments that compare favorably to other tools. Mugsy is also the fastest program evaluated for the multiple alignment of assembled human chromosome sequences from four individuals. Mugsy does not require a reference sequence, can align mixtures of assembled draft and completed genome data, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence. Availability: Mugsy is free, open-source software available from http://mugsy.sf.net. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. © The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [14:48 5/1/2011 Bioinformatics-btq665.tex] Page: 334 334–342 Mugsy duplications, with the segment-based multiple alignment method provided by the SeqAn C++ library. Mugsy also implements a novel algorithm for identifying locally collinear blocks (LCBs) from an alignment graph. The LCBs represent aligned regions from two or more genomes that are collinear and free of rearrangements but may also contain segments that lack homology and introduce gaps in the alignment. Mugsy is run as a single command line invocation that accepts a set of multi-FASTA files, one per genome and outputs a multiple alignment in MAF format. The Mugsy aligner is open source software and available for download at http://mugsy.sf.net. 2 METHODS The Mugsy alignment tool is comprised of four primary steps (Fig. 1): (1) an all-agains (...truncated)