Mugsy: fast multiple alignment of closely related whole genomes (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/27/3/334.full.pdf

Mugsy: fast multiple alignment of closely related whole genomes

Samuel V. Angiuoli 0 1 Steven L. Salzberg 1 Associate Editor: Dmitrij Frishman 0 Institute for Genome Sciences, University of Maryland School of Medicine , Baltimore, MD, USA 1 Center for Bioinformatics and Computational Biology, University of Maryland, College Park Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution. Results: We present a new multiple alignment tool for whole genomes named Mugsy. Mugsy is computationally efficient and can align 31 Streptococcus pneumoniae genomes in less than 2 hours producing alignments that compare favorably to other tools. Mugsy is also the fastest program evaluated for the multiple alignment of assembled human chromosome sequences from four individuals. Mugsy does not require a reference sequence, can align mixtures of assembled draft and completed genome data, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence. Availability: Mugsy is free, open-source software available from http://mugsy.sf.net. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 1 INTRODUCTION There are numerous sequenced genomes from organisms spanning across the tree of life. This number of genomes is expected to continue to grow dramatically in coming years due to advances in sequencing technologies and decreasing costs. For particular populations of interest, many individual genomes will be sequenced to study genetic diversity. The Cancer Genome Atlas, 1000 Genomes Project and the Personal Genome Project will generate genome sequences from at least several thousand people. For bacterial genomes, there are already over one thousand complete bacterial genomes in public databases. Often, a pan-genome concept is needed to describe a species or population (Medini et al., 2005), requiring multiple sequenced genomes from the same species. There are already nine bacterial species with ten or more sequenced genomes in a recent version of RefSeq. Hundreds of individual sequenced genomes are expected for some medically relevant species and model organisms, such as Escherichia coli. Many of these genomes will be in the form of draft genomes, where the sequencing reads are assembled into numerous contigs that together represent a fraction of the actual genome, but are incomplete and contain physical sequencing gaps. In order to make use of this explosive growth in the number of sequenced genomes, the scientific community requires tools that can quickly compare large numbers of long and highly similar sequences from whole genomes. Whole-genome alignment has become instrumental for studying genome evolution and genetic diversity (Batzoglou, 2005; Dewey and Pachter, 2006). There are a number of whole-genome alignment tools that can align multiple whole genomes (Blanchette et al., 2004; Darling et al., 2004; Dubchak et al., 2009; Hohl et al., 2002; Paten et al., 2008). Whole-genome alignment tools are distinguished from collinear multiple sequence alignment tools, such as tools of (Bradley et al., 2009; Edgar, 2004; Thompson et al., 1994), in that they can align very long sequences, millions of base pairs in length, detecting the presence of rearrangements, duplications, and large-scale sequence gain and loss. The resulting alignments can be utilized to build phylogenies, determine orthology, find recently duplicated regions, and identify species-specific DNA. For divergent sequences, alignment accuracy is difficult to assess and popular methods can disagree, such was demonstrated by the relatively low level of agreement between outputs for the ENCODE regions (Chen and Tompa, 2010; Margulies et al., 2007). Given the difficulties in assessing accuracy, recent development has included methods that are statistically motivated and show improved specificity ( Bradley et al., 2009; Paten et al., 2008). At shorter evolutionary distances with large fractions of identical sequences, there is less ambiguity in alignment outcomes. Yet, even within a bacterial species, aligning multiple genomes is not a trivial task, especially if the sequences contain rearrangements, duplications and exhibit sequence gain and loss. Also, despite relatively short chromosome lengths for bacteria, typically a few million base pairs, the computational complexity of multiple sequence alignment makes it a formidable challenge. Calculation of multiple alignments with a simple sum of pairs scoring scheme is known to be an NP-hard problem (Elias, 2006), which makes calculation of an exact solution infeasible for large inputs. Multiple genome alignment tools rely on heuristics to achieve reasonable run times. There are numerous methods to compare a single pair of wholegenome sequences (Bray et al., 2003; Schwartz et al., 2003). The Nucmer and MUMmer package is a fast whole-genome alignment method that utilizes a suffix tree to seed an alignment with maximal unique matches (MUMs) (Kurtz et al., 2004). The suffix tree implementation of MUMmer is especially efficient and can be both built and searched in time and space that is linear in proportion to the input sequence length. Graph-based methods have been widely employed for pairwise and multiple alignment of long sequences (Raphael et al., 2004; Zhang and Waterman, 2005). The segment-based progressive alignment approach implemented in SeqAn::T-Coffee (Rausch et al., 2008) utilizes an alignment graph scored for consistency and a progressive alignment scheme to calculate multiple alignments. In brief, an alignment graph is composed of vertices corresponding to non-overlapping genomic regions with edges indicating matches between regions. The alignment graph can be built efficiently for multiple sequences from a set of pairwise alignments and is scored for consistency. Consistency scoring has been demonstrated to perform well in resolving problems in progressive alignment (Notredame et al., 2000; Paten et al., 2009). A multiple alignment can then be derived from the graph using an efficient heaviest common subsequence algorithm (Jacobson and Vo, 1992). A noteworthy property of the alignment graph is that each genomic segment that is aligned without gaps in all pairwise alignments is represented as a single vertex in t (...truncated)