Mugsy: fast multiple alignment of closely related whole genomes
Samuel V. Angiuoli
0
1
Steven L. Salzberg
1
Associate Editor: Dmitrij Frishman
0
Institute for Genome Sciences, University of Maryland School of Medicine
,
Baltimore, MD, USA
1
Center for Bioinformatics and Computational Biology, University of Maryland, College Park
Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution. Results: We present a new multiple alignment tool for whole genomes named Mugsy. Mugsy is computationally efficient and can align 31 Streptococcus pneumoniae genomes in less than 2 hours producing alignments that compare favorably to other tools. Mugsy is also the fastest program evaluated for the multiple alignment of assembled human chromosome sequences from four individuals. Mugsy does not require a reference sequence, can align mixtures of assembled draft and completed genome data, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence. Availability: Mugsy is free, open-source software available from http://mugsy.sf.net. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 INTRODUCTION
There are numerous sequenced genomes from organisms spanning
across the tree of life. This number of genomes is expected to
continue to grow dramatically in coming years due to advances
in sequencing technologies and decreasing costs. For particular
populations of interest, many individual genomes will be sequenced
to study genetic diversity. The Cancer Genome Atlas, 1000 Genomes
Project and the Personal Genome Project will generate genome
sequences from at least several thousand people. For bacterial
genomes, there are already over one thousand complete bacterial
genomes in public databases. Often, a pan-genome concept is
needed to describe a species or population (Medini et al., 2005),
requiring multiple sequenced genomes from the same species. There
are already nine bacterial species with ten or more sequenced
genomes in a recent version of RefSeq. Hundreds of individual
sequenced genomes are expected for some medically relevant
species and model organisms, such as Escherichia coli. Many of
these genomes will be in the form of draft genomes, where the
sequencing reads are assembled into numerous contigs that together
represent a fraction of the actual genome, but are incomplete and
contain physical sequencing gaps. In order to make use of this
explosive growth in the number of sequenced genomes, the scientific
community requires tools that can quickly compare large numbers
of long and highly similar sequences from whole genomes.
Whole-genome alignment has become instrumental for studying
genome evolution and genetic diversity (Batzoglou, 2005; Dewey
and Pachter, 2006). There are a number of whole-genome alignment
tools that can align multiple whole genomes (Blanchette et al.,
2004; Darling et al., 2004; Dubchak et al., 2009; Hohl et al., 2002;
Paten et al., 2008). Whole-genome alignment tools are distinguished
from collinear multiple sequence alignment tools, such as tools of
(Bradley et al., 2009; Edgar, 2004; Thompson et al., 1994), in
that they can align very long sequences, millions of base pairs in
length, detecting the presence of rearrangements, duplications, and
large-scale sequence gain and loss. The resulting alignments can
be utilized to build phylogenies, determine orthology, find recently
duplicated regions, and identify species-specific DNA. For divergent
sequences, alignment accuracy is difficult to assess and popular
methods can disagree, such was demonstrated by the relatively low
level of agreement between outputs for the ENCODE regions (Chen
and Tompa, 2010; Margulies et al., 2007). Given the difficulties in
assessing accuracy, recent development has included methods that
are statistically motivated and show improved specificity ( Bradley
et al., 2009; Paten et al., 2008).
At shorter evolutionary distances with large fractions of identical
sequences, there is less ambiguity in alignment outcomes. Yet,
even within a bacterial species, aligning multiple genomes is not
a trivial task, especially if the sequences contain rearrangements,
duplications and exhibit sequence gain and loss. Also, despite
relatively short chromosome lengths for bacteria, typically a few
million base pairs, the computational complexity of multiple
sequence alignment makes it a formidable challenge. Calculation
of multiple alignments with a simple sum of pairs scoring scheme
is known to be an NP-hard problem (Elias, 2006), which makes
calculation of an exact solution infeasible for large inputs. Multiple
genome alignment tools rely on heuristics to achieve reasonable run
times.
There are numerous methods to compare a single pair of
wholegenome sequences (Bray et al., 2003; Schwartz et al., 2003). The
Nucmer and MUMmer package is a fast whole-genome alignment
method that utilizes a suffix tree to seed an alignment with maximal
unique matches (MUMs) (Kurtz et al., 2004). The suffix tree
implementation of MUMmer is especially efficient and can be both
built and searched in time and space that is linear in proportion to
the input sequence length.
Graph-based methods have been widely employed for pairwise
and multiple alignment of long sequences (Raphael et al., 2004;
Zhang and Waterman, 2005). The segment-based progressive
alignment approach implemented in SeqAn::T-Coffee (Rausch et al.,
2008) utilizes an alignment graph scored for consistency and a
progressive alignment scheme to calculate multiple alignments. In
brief, an alignment graph is composed of vertices corresponding
to non-overlapping genomic regions with edges indicating matches
between regions. The alignment graph can be built efficiently
for multiple sequences from a set of pairwise alignments and is
scored for consistency. Consistency scoring has been demonstrated
to perform well in resolving problems in progressive alignment
(Notredame et al., 2000; Paten et al., 2009). A multiple alignment
can then be derived from the graph using an efficient heaviest
common subsequence algorithm (Jacobson and Vo, 1992). A
noteworthy property of the alignment graph is that each genomic
segment that is aligned without gaps in all pairwise alignments is
represented as a single vertex in t (...truncated)