An ILP solution for the gene duplication problem
Chang et al. BMC Bioinformatics 2011, 12(Suppl 1):S14
http://www.biomedcentral.com/1471-2105/12/S1/S14
RESEARCH
Open Access
An ILP solution for the gene duplication problem
Wen-Chieh Chang1, Gordon J Burleigh2, David F Fernández-Baca1, Oliver Eulenstein1*
From The Ninth Asia Pacific Bioinformatics Conference (APBC 2011)
Inchon, Korea. 11-14 January 2011
Abstract
Background: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication
events across a given collection of gene trees. Solving this problem makes it possible to use large gene families
with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard,
and therefore, most analyses use heuristics that lack any performance guarantee.
Results: We describe the first integer linear programming (ILP) formulation to solve instances of the gene
duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances
with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the
seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales
sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in
plant systematics.
Conclusions: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets
consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been
solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic
questions that previously could only be addressed by heuristic estimates.
Background
With recent advances in DNA sequencing technology,
there is much interest in using genomic data sets to
infer phylogenetic trees. However, evolutionary events
such as gene duplication and loss, incomplete lineage
sorting (deep coalescence), and lateral gene transfer can
produce discordance between gene trees and the phylogeny of the species in which the genes evolve (e.g., [1]).
The gene tree parsimony (GTP) problem [1-4] provides
a direct approach to infer a species phylogeny from discordant gene trees. Given a collection of gene trees, this
problem seeks a species tree that implies the minimum
reconciliation cost, i.e., the fewest number of evolutionary events that can explain discordance in the gene
phylogenies.
One of the most widely studied variants of the GTP
problems is the gene duplication (GD) problem, in
which the reconciliation cost is based on gene
* Correspondence:
1
Department of Computer Science, Iowa State University, Ames, 50011, USA
Full list of author information is available at the end of the article
duplication events. The GD problem is W[2]-hard when
parameterized by the number of gene duplications
events and hard to approximate better than a logarithmic factor [5]. One way to cope with this intractability
in practice is using heuristics [6,7]. Although these heuristics do not guarantee optimal solutions or any nontrivial theoretical bound, in many cases they appear to
have produced credible estimates [8-11]. However, the
lack of performance guarantees makes the pursuit of
exact solutions for the GD problem desirable.
Exact solutions can be found by exhaustive search for
every NP-complete problem, but run times typically
become prohibitively large for even rather small sized
instances. However, exact algorithms that are substantially faster than exhaustive search have been progressively developed (e.g. [12,13]). Unfortunately, little work
has focused on such algorithms for the GD problem
[14]. Here, we describe an ILP formulation solving the
GD problem exactly and demonstrate its performance
on both simulated and empirical data sets.
© 2011 Chang et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Chang et al. BMC Bioinformatics 2011, 12(Suppl 1):S14
http://www.biomedcentral.com/1471-2105/12/S1/S14
Related work
Exact solutions to the GD problem were obtained by
exhaustively searching all possible species trees in data
sets with up to 8 taxa [15,16]. More recently, a branchand-bound algorithm to identify exact solutions for the
GD problem was introduced [14]. This algorithm was
applied to a data-set consisting of 1, 111 gene trees with
29-taxa, but it did not run to completion. However, the
branch-and-bound algorithm was able to solve this
instance on reduced search spaces that resulted from
providing some of the relationships in the species tree.
Although constraining the search space for a species
tree can help solving difficult instances of the GD problem, there are no theoretical guarantees to support this
approach.
ILP formulations have provided an effective strategy to
solve moderately sized instances of several NP-hard phylogenetic problems (e.g. [17-22]). Most similar to the
GD problem, ILP formulations have been introduced for
the deep coalescence problem, the variant of the GTP
problem in which the reconciliation cost is based on the
deep coalescence events [23]. These formulations solved
instances with up to 8 taxa. However, perhaps due to
the difficulty of directly expressing gene duplications in
terms of linear equations, there have been no ILP formulations for the DP problem.
Our contributions
We introduce a novel approach to solve the GD problem exactly by describing the first ILP formulation for
this problem. This solution is made possible by revealing
an alternate description of the GD problem, called the
triple inconsistency problem, which expresses gene
duplications in terms of rooted triples. Rooted triples
are rooted full binary trees with three leaves, and are
the smallest unit of phylogenetic information. They,
together with an equivalent presentation of species trees
through cluster hierarchies, provide the fundamental
elements of our ILP solution.
We demonstrate that our ILP formulation can solve
non-trivial instances with up to 14 taxa and 1,000 gene
trees. This greatly improves upon the largest (unconstrained) instances of the GD problem that have been
solved exactly to date. Finally, we use the ILP formulation to address the relationships among the major seed
plant lineages.Our ILP formulation was able to solve the
GD problem exactly for a 12-taxon data set using 6,084
gene trees.
Methods
Preliminaries
Basic definitions
A rooted tree T is a connected and acyclic graph consisting of a vertex set V(T), an edge set E(T), and that
Page 2 of 8
has exactly one distinguished vertex called root, which
we denote by Rt(T). Let T be a rooted tree. We define
≤T to be the partial order on V(T), where u ≤T v if v is a
vertex o (...truncated)