progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0011147&type=printable

progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

Citation: Darling AE, Mau B, Perna NT ( progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement Aaron E. Darling 0 Bob Mau 0 Nicole T. Perna 0 Jason E. Stajich, University of California Riverside, United States of America 0 1 Genome Center and Department of Computer Science, University of Wisconsin, Madison, Wisconsin, United States of America, 2 Biotechnology Center and Department of Oncology, University of Wisconsin, Madison, Wisconsin, United States of America, 3 Genome Center and Department of Genetics, University of Wisconsin , Madison, Wisconsin , United States of America Background: Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. Methodology/Principal Findings: We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pangenome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence. Conclusions: The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve. - Funding: This work was supported in part by National Institutes of Health (NIH) grant R01-GM62994 to N.T.P. and National Science Foundation (NSF) grant DBI0630765 to A.E.D. This project has also been funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, NIH, Department of Health and Human Services, under Contract No. HHSN266200400040C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Multiple genome alignment is among the most basic tools in the comparative genomics toolbox, however its application has been hampered by concerns of accuracy and practicality [13]. Accurate genome alignment represents a necessary prerequisite for myriad comparative genomic analyses. During the course of evolution, genomes undergo both local and large-scale mutational processes. Local mutations affect only a small number of nucleotides and include nucleotide substitution and insertion or deletion of nucleotides. Large-scale mutations can include gain and loss or duplication of large segments, generated by unequal recombination or other processes. Homologous recombination can lead to replacement of whole genes, or even larger segments of the chromosome with non-identical but homologous sequences. Together, these mutational processes cause otherwise identical regions in two or more genomes to be fragmented, reordered, possibly missing, and even to occur in multiple copies. The genome alignment task seeks to identify the homologous nucleotides in two or more genomes, that is, a genome alignment identifies nucleotides that descended from a single site in some ancestral organism. Homologous sites can be classified in any number of ways, and the genome alignment task usually targets the identification of certain classes of nucleotides. Homologous sites are commonly classified by evolutionary history such as orthology, paralogy, and xenology [4,5]. Sites can also be classified by non-evolutionary relationships such as the number or identity of organisms involved (e.g. only homologous sites involving an important reference organism such as Homo sapiens), or even by ordering relationships relative to other homologous nucleotides (e.g. collinearity). Genome alignment methods often define their target alignment to consist of homologous nucleotides falling into one or more of those classes. Early work in genome alignment included development of MUMmer, which identifies homologous sites in pairs of genomes [68]. MUMmer aligns orthologous and xenologous sequences with the further constraint that any site in a genome can be aligned to at most one site in the other genome. Pairs of homologous sites within a single genome (paralogs) are never aligned to each other. The first stage of MUMmer alignment involves identifying alignment anchors. Alignment anchors are local alignments of highly identical sequence that by virtue of their high identity, can be easily found algorithmically, and are presumed to be part of the true alignment. MUMmer then aggregates local alignment anchors into one or more groups that cover collinear regions of the two genomes. Each group of anchors is internally free from rearrangement, but the order of groups may be shuffled from one genome to another. As such, MUMmer can identify and align genomes with rearranged homologous sequences. However MUMmer does not align paralogous sequences (repeats within a genome), nor does it align all copies of multi-copy orthologous sequence. Because it aligns any site to at most one site in the other genome, and due to the way it anchors alignment of repetitive sequence using neighboring unique regions, MUMmer often aligns the positionally conserved copy of a repeat element. We term this type of alignment a positional homology genome alignment; such alignments are also generated by a method we developed previously [9]. In the present work, we describe a new method to construct positional homology multiple ge (...truncated)