Gene tree parsimony for incomplete gene trees: addressing true biological loss
Bayzid and Warnow Algorithms Mol Biol (2018) 13:1
https://doi.org/10.1186/s13015-017-0120-1
Algorithms for
Molecular Biology
Open Access
RESEARCH
Gene tree parsimony for incomplete
gene trees: addressing true biological loss
Md Shamsuzzoha Bayzid1* and Tandy Warnow2
Abstract
Motivation: Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene
tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with
respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs
containing some incomplete gene trees (i.e., gene trees lacking one or more of the species).
Results: We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e.,
true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact
but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove
that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon
sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the
DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup.
Keywords: Gene duplication and loss, Gene tree parsimony, Deep coalescence, Dynamic programming
Background
The estimation of species trees is often performed by
estimating multiple sequence alignments for some collection of genes, concatenating these alignments into
one supermatrix, and then estimating a tree (often using
maximum likelihood or a Bayesian technique) on the
resultant supermatrix. However, this approach cannot be
used when the species’ genomes contain multiple copies
of some gene, which can result from gene duplication.
Since gene duplication and loss is a common phenomenon, the estimation of species trees requires a different
type of approach in this case.
The most powerful approaches for species tree estimation for multi-copy gene families are likely to be methods
such as Phyldog [1], which co-estimate gene trees and
species trees under parametric models of gene evolution that include duplications and losses. Another type
of approach uses initial assignments of orthology and
*Correspondence:
1
Department of Computer Science and Engineering, Bangladesh
University of Engineering and Technology, Dhaka, Bangladesh
Full list of author information is available at the end of the article
paralogy to inform gene tree and species tree estimation
[2]. However, by far the most common approach for estimating species trees uses gene tree parsimony, which we
now describe.
Gene tree parsimony (GTP) is an optimization problem
for estimating species trees from a set of gene trees (estimated from individual gene sequence alignments). In its
most typical formulations, only gene duplication and loss
are considered, so that GTP depends upon two parameters: cd (the cost for a duplication) and cl (the cost for a
loss). The two most popular versions of GTP are MGD
(minimize gene duplication), for which cd = 1 and cl = 0,
and MGDL (minimize gene duplication and loss), for
which cd = cl = 1. The version of GTP that seeks the tree
minimizing the total number of losses has also been studied; for this, cd = 0 and cl = 1. These variants of GTP are
NP-hard optimization problems [3], but software such as
DupTree [4] and iGTP [5] for GTP are in wide use.
Basic to all these problems is the ability to compute
the number of duplications and losses implied by a species tree and gene tree. This problem is called the “reconciliation problem”, surveyed in [6], and intensively
studied in the literature (see, for example, [3, 7–17]). The
© The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license,
and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/
publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Bayzid and Warnow Algorithms Mol Biol (2018) 13:1
mathematical formulation of the reconciliation problem
was derived for the case where the gene tree and the species tree have the same set of taxa, and then extended to
be able to be used on incomplete gene trees, i.e., trees that
can miss some taxa.
Incomplete gene trees are quite common, and can arise
for two different reasons: (1) taxon sampling: the gene
may be available in the species’ genome, but was not
included for some reason in the dataset for that gene, or
(2) gene birth/death: as a result of gene birth and death
(true biological gene loss), the species does not have the
gene in its genome.
Given a gene tree gt and a species tree ST, two formulations for the number of losses have been defined. The
most commonly used one computes the number of losses
by first computing the “homeomorphic subtree” ST(gt)
of ST induced by gt, and then computing the number
of losses required to reconcile gt with ST(gt) (see, for
example, [3, 8, 17]). Although this second formulation is
in wide use (and is the basis of both iGTP [5] and Duptree [4], two popular methods for “solving” GTP), we will
show that this can be incorrect when incompleteness is
due to true biological loss. We refer to this formulation as
the “standard” approach because of this widespread use
in both software and the theoretical literature on GTP.
The other, described in [18, 19], correctly computes the
number of losses when incompleteness is a result of true
gene loss, as we will prove.
This paper addresses the GTP problem for the case
where some of the input gene trees may be incomplete
due to either sampling or true biological loss. The main
results are as follows:
•• We formalize the duploss reconciliation problem
when gene trees are incomplete due to taxon sampling as the “optimal completion of a gene tree”, and
we prove (Theorem 1) that the standard calculation
correctly computes losses for this case.
•• We show by example that the standard calculation
for losses in GTP can be incorrect when incompleteness is due to true biological loss.
•• We show how to compute the number of losses
implied by a gene tree and species tree, when incompleteness is due to true biological loss.
•• We formulate variants of the GTP problem (when
gene tree incompleteness is due to true biological
loss) as minimum weight maximum clique problems (see Theorem 11 for one duploss variant), and
show ho (...truncated)