Gene tree parsimony for incomplete gene trees: addressing true biological loss

Algorithms for Molecular Biology, Jan 2018

Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species). We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup .

Article PDF cannot be displayed. You can download it here:

https://almob.biomedcentral.com/track/pdf/10.1186/s13015-017-0120-1

Gene tree parsimony for incomplete gene trees: addressing true biological loss

Bayzid and Warnow Algorithms Mol Biol (2018) 13:1 https://doi.org/10.1186/s13015-017-0120-1 Algorithms for Molecular Biology Open Access RESEARCH Gene tree parsimony for incomplete gene trees: addressing true biological loss Md Shamsuzzoha Bayzid1* and Tandy Warnow2 Abstract Motivation: Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species). Results: We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup. Keywords: Gene duplication and loss, Gene tree parsimony, Deep coalescence, Dynamic programming Background The estimation of species trees is often performed by estimating multiple sequence alignments for some collection of genes, concatenating these alignments into one supermatrix, and then estimating a tree (often using maximum likelihood or a Bayesian technique) on the resultant supermatrix. However, this approach cannot be used when the species’ genomes contain multiple copies of some gene, which can result from gene duplication. Since gene duplication and loss is a common phenomenon, the estimation of species trees requires a different type of approach in this case. The most powerful approaches for species tree estimation for multi-copy gene families are likely to be methods such as Phyldog [1], which co-estimate gene trees and species trees under parametric models of gene evolution that include duplications and losses. Another type of approach uses initial assignments of orthology and *Correspondence: 1 Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh Full list of author information is available at the end of the article paralogy to inform gene tree and species tree estimation [2]. However, by far the most common approach for estimating species trees uses gene tree parsimony, which we now describe. Gene tree parsimony (GTP) is an optimization problem for estimating species trees from a set of gene trees (estimated from individual gene sequence alignments). In its most typical formulations, only gene duplication and loss are considered, so that GTP depends upon two parameters: cd (the cost for a duplication) and cl (the cost for a loss). The two most popular versions of GTP are MGD (minimize gene duplication), for which cd = 1 and cl = 0, and MGDL (minimize gene duplication and loss), for which cd = cl = 1. The version of GTP that seeks the tree minimizing the total number of losses has also been studied; for this, cd = 0 and cl = 1. These variants of GTP are NP-hard optimization problems [3], but software such as DupTree [4] and iGTP [5] for GTP are in wide use. Basic to all these problems is the ability to compute the number of duplications and losses implied by a species tree and gene tree. This problem is called the “reconciliation problem”, surveyed in [6], and intensively studied in the literature (see, for example, [3, 7–17]). The © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Bayzid and Warnow Algorithms Mol Biol (2018) 13:1 mathematical formulation of the reconciliation problem was derived for the case where the gene tree and the species tree have the same set of taxa, and then extended to be able to be used on incomplete gene trees, i.e., trees that can miss some taxa. Incomplete gene trees are quite common, and can arise for two different reasons: (1) taxon sampling: the gene may be available in the species’ genome, but was not included for some reason in the dataset for that gene, or (2) gene birth/death: as a result of gene birth and death (true biological gene loss), the species does not have the gene in its genome. Given a gene tree gt and a species tree ST, two formulations for the number of losses have been defined. The most commonly used one computes the number of losses by first computing the “homeomorphic subtree” ST(gt) of ST induced by gt, and then computing the number of losses required to reconcile gt with ST(gt) (see, for example, [3, 8, 17]). Although this second formulation is in wide use (and is the basis of both iGTP [5] and Duptree [4], two popular methods for “solving” GTP), we will show that this can be incorrect when incompleteness is due to true biological loss. We refer to this formulation as the “standard” approach because of this widespread use in both software and the theoretical literature on GTP. The other, described in [18, 19], correctly computes the number of losses when incompleteness is a result of true gene loss, as we will prove. This paper addresses the GTP problem for the case where some of the input gene trees may be incomplete due to either sampling or true biological loss. The main results are as follows: ••  We formalize the duploss reconciliation problem when gene trees are incomplete due to taxon sampling as the “optimal completion of a gene tree”, and we prove (Theorem 1) that the standard calculation correctly computes losses for this case. ••  We show by example that the standard calculation for losses in GTP can be incorrect when incompleteness is due to true biological loss. ••  We show how to compute the number of losses implied by a gene tree and species tree, when incompleteness is due to true biological loss. ••  We formulate variants of the GTP problem (when gene tree incompleteness is due to true biological loss) as minimum weight maximum clique problems (see Theorem 11 for one duploss variant), and show ho (...truncated)


This is a preview of a remote PDF: https://almob.biomedcentral.com/track/pdf/10.1186/s13015-017-0120-1
Article home page: https://almob.biomedcentral.com/articles/10.1186/s13015-017-0120-1

Md Shamsuzzoha Bayzid, Tandy Warnow. Gene tree parsimony for incomplete gene trees: addressing true biological loss, Algorithms for Molecular Biology, 2018, pp. 1-12, Volume 13, Issue 1, DOI: 10.1186/s13015-017-0120-1