progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
Citation: Darling AE, Mau B, Perna NT (
progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
Aaron E. Darling 0
Bob Mau 0
Nicole T. Perna 0
Jason E. Stajich, University of California Riverside, United States of America
0 1 Genome Center and Department of Computer Science, University of Wisconsin, Madison, Wisconsin, United States of America, 2 Biotechnology Center and Department of Oncology, University of Wisconsin, Madison, Wisconsin, United States of America, 3 Genome Center and Department of Genetics, University of Wisconsin , Madison, Wisconsin , United States of America
Background: Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. Methodology/Principal Findings: We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pangenome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence. Conclusions: The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve.
-
Funding: This work was supported in part by National Institutes of Health (NIH) grant R01-GM62994 to N.T.P. and National Science Foundation (NSF) grant
DBI0630765 to A.E.D. This project has also been funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, NIH, Department of
Health and Human Services, under Contract No. HHSN266200400040C. The funders had no role in study design, data collection and analysis, decision to publish,
or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Multiple genome alignment is among the most basic tools in the
comparative genomics toolbox, however its application has been
hampered by concerns of accuracy and practicality [13].
Accurate genome alignment represents a necessary prerequisite
for myriad comparative genomic analyses.
During the course of evolution, genomes undergo both local
and large-scale mutational processes. Local mutations affect only a
small number of nucleotides and include nucleotide substitution
and insertion or deletion of nucleotides. Large-scale mutations can
include gain and loss or duplication of large segments, generated
by unequal recombination or other processes. Homologous
recombination can lead to replacement of whole genes, or even
larger segments of the chromosome with non-identical but
homologous sequences. Together, these mutational processes
cause otherwise identical regions in two or more genomes to be
fragmented, reordered, possibly missing, and even to occur in
multiple copies.
The genome alignment task seeks to identify the homologous
nucleotides in two or more genomes, that is, a genome alignment
identifies nucleotides that descended from a single site in some
ancestral organism. Homologous sites can be classified in any
number of ways, and the genome alignment task usually targets
the identification of certain classes of nucleotides. Homologous
sites are commonly classified by evolutionary history such as
orthology, paralogy, and xenology [4,5]. Sites can also be classified
by non-evolutionary relationships such as the number or identity
of organisms involved (e.g. only homologous sites involving an
important reference organism such as Homo sapiens), or even by
ordering relationships relative to other homologous nucleotides
(e.g. collinearity). Genome alignment methods often define their
target alignment to consist of homologous nucleotides falling into
one or more of those classes.
Early work in genome alignment included development of
MUMmer, which identifies homologous sites in pairs of genomes
[68]. MUMmer aligns orthologous and xenologous sequences
with the further constraint that any site in a genome can be aligned
to at most one site in the other genome. Pairs of homologous sites
within a single genome (paralogs) are never aligned to each other.
The first stage of MUMmer alignment involves identifying
alignment anchors. Alignment anchors are local alignments of
highly identical sequence that by virtue of their high identity, can
be easily found algorithmically, and are presumed to be part of the
true alignment. MUMmer then aggregates local alignment
anchors into one or more groups that cover collinear regions of
the two genomes. Each group of anchors is internally free from
rearrangement, but the order of groups may be shuffled from one
genome to another. As such, MUMmer can identify and align
genomes with rearranged homologous sequences. However
MUMmer does not align paralogous sequences (repeats within a
genome), nor does it align all copies of multi-copy orthologous
sequence. Because it aligns any site to at most one site in the other
genome, and due to the way it anchors alignment of repetitive
sequence using neighboring unique regions, MUMmer often
aligns the positionally conserved copy of a repeat element. We
term this type of alignment a positional homology genome alignment;
such alignments are also generated by a method we developed
previously [9].
In the present work, we describe a new method to construct
positional homology multiple ge (...truncated)