Mugsy: fast multiple alignment of closely related whole genomes
BIOINFORMATICS
ORIGINAL PAPER
Sequence analysis
Vol. 27 no. 3 2011, pages 334–342
doi:10.1093/bioinformatics/btq665
Advance Access publication December 9, 2010
Mugsy: fast multiple alignment of closely related whole genomes
Samuel V. Angiuoli1,2,∗ and Steven L. Salzberg1
1 Center
for Bioinformatics and Computational Biology, University of Maryland, College Park and 2 Institute for
Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
Associate Editor: Dmitrij Frishman
Received on June 23, 2010; revised on November 29, 2010; accepted
on November 30, 2010
1
INTRODUCTION
There are numerous sequenced genomes from organisms spanning
across the tree of life. This number of genomes is expected to
continue to grow dramatically in coming years due to advances
in sequencing technologies and decreasing costs. For particular
populations of interest, many individual genomes will be sequenced
to study genetic diversity. The Cancer Genome Atlas, 1000 Genomes
Project and the Personal Genome Project will generate genome
sequences from at least several thousand people. For bacterial
genomes, there are already over one thousand complete bacterial
genomes in public databases. Often, a pan-genome concept is
needed to describe a species or population (Medini et al., 2005),
requiring multiple sequenced genomes from the same species. There
are already nine bacterial species with ten or more sequenced
genomes in a recent version of RefSeq. Hundreds of individual
sequenced genomes are expected for some medically relevant
∗ To
whom correspondence should be addressed.
species and model organisms, such as Escherichia coli. Many of
these genomes will be in the form of “draft” genomes, where the
sequencing reads are assembled into numerous contigs that together
represent a fraction of the actual genome, but are incomplete and
contain physical sequencing gaps. In order to make use of this
explosive growth in the number of sequenced genomes, the scientific
community requires tools that can quickly compare large numbers
of long and highly similar sequences from whole genomes.
Whole-genome alignment has become instrumental for studying
genome evolution and genetic diversity (Batzoglou, 2005; Dewey
and Pachter, 2006). There are a number of whole-genome alignment
tools that can align multiple whole genomes (Blanchette et al.,
2004; Darling et al., 2004; Dubchak et al., 2009; Hohl et al., 2002;
Paten et al., 2008). Whole-genome alignment tools are distinguished
from collinear multiple sequence alignment tools, such as tools of
(Bradley et al., 2009; Edgar, 2004; Thompson et al., 1994), in
that they can align very long sequences, millions of base pairs in
length, detecting the presence of rearrangements, duplications, and
large-scale sequence gain and loss. The resulting alignments can
be utilized to build phylogenies, determine orthology, find recently
duplicated regions, and identify species-specific DNA. For divergent
sequences, alignment accuracy is difficult to assess and popular
methods can disagree, such was demonstrated by the relatively low
level of agreement between outputs for the ENCODE regions (Chen
and Tompa, 2010; Margulies et al., 2007). Given the difficulties in
assessing accuracy, recent development has included methods that
are statistically motivated and show improved specificity ( Bradley
et al., 2009; Paten et al., 2008).
At shorter evolutionary distances with large fractions of identical
sequences, there is less ambiguity in alignment outcomes. Yet,
even within a bacterial species, aligning multiple genomes is not
a trivial task, especially if the sequences contain rearrangements,
duplications and exhibit sequence gain and loss. Also, despite
relatively short chromosome lengths for bacteria, typically a few
million base pairs, the computational complexity of multiple
sequence alignment makes it a formidable challenge. Calculation
of multiple alignments with a simple sum of pairs scoring scheme
is known to be an NP-hard problem (Elias, 2006), which makes
calculation of an exact solution infeasible for large inputs. Multiple
genome alignment tools rely on heuristics to achieve reasonable run
times.
There are numerous methods to compare a single pair of wholegenome sequences (Bray et al., 2003; Schwartz et al., 2003). The
Nucmer and MUMmer package is a fast whole-genome alignment
method that utilizes a suffix tree to seed an alignment with maximal
unique matches (MUMs) (Kurtz et al., 2004). The suffix tree
implementation of MUMmer is especially efficient and can be both
ABSTRACT
Motivation: The relative ease and low cost of current generation
sequencing technologies has led to a dramatic increase in the
number of sequenced genomes for species across the tree of
life. This increasing volume of data requires tools that can quickly
compare multiple whole-genome sequences, millions of base pairs in
length, to aid in the study of populations, pan-genomes, and genome
evolution.
Results: We present a new multiple alignment tool for whole
genomes named Mugsy. Mugsy is computationally efficient and
can align 31 Streptococcus pneumoniae genomes in less than
2 hours producing alignments that compare favorably to other
tools. Mugsy is also the fastest program evaluated for the multiple
alignment of assembled human chromosome sequences from four
individuals. Mugsy does not require a reference sequence, can
align mixtures of assembled draft and completed genome data,
and is robust in identifying a rich complement of genetic variation
including duplications, rearrangements, and large-scale gain and loss
of sequence.
Availability: Mugsy is free, open-source software available from
http://mugsy.sf.net.
Contact:
Supplementary information: Supplementary data are available at
Bioinformatics online.
© The Author(s) 2010. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
[14:48 5/1/2011 Bioinformatics-btq665.tex]
Page: 334
334–342
Mugsy
duplications, with the segment-based multiple alignment method
provided by the SeqAn C++ library. Mugsy also implements a novel
algorithm for identifying locally collinear blocks (LCBs) from an
alignment graph. The LCBs represent aligned regions from two or
more genomes that are collinear and free of rearrangements but may
also contain segments that lack homology and introduce gaps in the
alignment. Mugsy is run as a single command line invocation that
accepts a set of multi-FASTA files, one per genome and outputs
a multiple alignment in MAF format. The Mugsy aligner is open
source software and available for download at http://mugsy.sf.net.
2
METHODS
The Mugsy alignment tool is comprised of four primary steps (Fig. 1):
(1) an all-agains (...truncated)