An ILP solution for the gene duplication problem

BMC Bioinformatics, Feb 2011

Background The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. Results We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. Conclusions Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-12-S1-S14.pdf

An ILP solution for the gene duplication problem

Chang et al. BMC Bioinformatics 2011, 12(Suppl 1):S14 http://www.biomedcentral.com/1471-2105/12/S1/S14 RESEARCH Open Access An ILP solution for the gene duplication problem Wen-Chieh Chang1, Gordon J Burleigh2, David F Fernández-Baca1, Oliver Eulenstein1* From The Ninth Asia Pacific Bioinformatics Conference (APBC 2011) Inchon, Korea. 11-14 January 2011 Abstract Background: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. Results: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. Conclusions: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates. Background With recent advances in DNA sequencing technology, there is much interest in using genomic data sets to infer phylogenetic trees. However, evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and lateral gene transfer can produce discordance between gene trees and the phylogeny of the species in which the genes evolve (e.g., [1]). The gene tree parsimony (GTP) problem [1-4] provides a direct approach to infer a species phylogeny from discordant gene trees. Given a collection of gene trees, this problem seeks a species tree that implies the minimum reconciliation cost, i.e., the fewest number of evolutionary events that can explain discordance in the gene phylogenies. One of the most widely studied variants of the GTP problems is the gene duplication (GD) problem, in which the reconciliation cost is based on gene * Correspondence: 1 Department of Computer Science, Iowa State University, Ames, 50011, USA Full list of author information is available at the end of the article duplication events. The GD problem is W[2]-hard when parameterized by the number of gene duplications events and hard to approximate better than a logarithmic factor [5]. One way to cope with this intractability in practice is using heuristics [6,7]. Although these heuristics do not guarantee optimal solutions or any nontrivial theoretical bound, in many cases they appear to have produced credible estimates [8-11]. However, the lack of performance guarantees makes the pursuit of exact solutions for the GD problem desirable. Exact solutions can be found by exhaustive search for every NP-complete problem, but run times typically become prohibitively large for even rather small sized instances. However, exact algorithms that are substantially faster than exhaustive search have been progressively developed (e.g. [12,13]). Unfortunately, little work has focused on such algorithms for the GD problem [14]. Here, we describe an ILP formulation solving the GD problem exactly and demonstrate its performance on both simulated and empirical data sets. © 2011 Chang et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Chang et al. BMC Bioinformatics 2011, 12(Suppl 1):S14 http://www.biomedcentral.com/1471-2105/12/S1/S14 Related work Exact solutions to the GD problem were obtained by exhaustively searching all possible species trees in data sets with up to 8 taxa [15,16]. More recently, a branchand-bound algorithm to identify exact solutions for the GD problem was introduced [14]. This algorithm was applied to a data-set consisting of 1, 111 gene trees with 29-taxa, but it did not run to completion. However, the branch-and-bound algorithm was able to solve this instance on reduced search spaces that resulted from providing some of the relationships in the species tree. Although constraining the search space for a species tree can help solving difficult instances of the GD problem, there are no theoretical guarantees to support this approach. ILP formulations have provided an effective strategy to solve moderately sized instances of several NP-hard phylogenetic problems (e.g. [17-22]). Most similar to the GD problem, ILP formulations have been introduced for the deep coalescence problem, the variant of the GTP problem in which the reconciliation cost is based on the deep coalescence events [23]. These formulations solved instances with up to 8 taxa. However, perhaps due to the difficulty of directly expressing gene duplications in terms of linear equations, there have been no ILP formulations for the DP problem. Our contributions We introduce a novel approach to solve the GD problem exactly by describing the first ILP formulation for this problem. This solution is made possible by revealing an alternate description of the GD problem, called the triple inconsistency problem, which expresses gene duplications in terms of rooted triples. Rooted triples are rooted full binary trees with three leaves, and are the smallest unit of phylogenetic information. They, together with an equivalent presentation of species trees through cluster hierarchies, provide the fundamental elements of our ILP solution. We demonstrate that our ILP formulation can solve non-trivial instances with up to 14 taxa and 1,000 gene trees. This greatly improves upon the largest (unconstrained) instances of the GD problem that have been solved exactly to date. Finally, we use the ILP formulation to address the relationships among the major seed plant lineages.Our ILP formulation was able to solve the GD problem exactly for a 12-taxon data set using 6,084 gene trees. Methods Preliminaries Basic definitions A rooted tree T is a connected and acyclic graph consisting of a vertex set V(T), an edge set E(T), and that Page 2 of 8 has exactly one distinguished vertex called root, which we denote by Rt(T). Let T be a rooted tree. We define ≤T to be the partial order on V(T), where u ≤T v if v is a vertex o (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-12-S1-S14.pdf
Article home page: http://www.biomedcentral.com/1471-2105/12/S1/S14

Wen-Chieh Chang, Gordon J Burleigh, David F Fernández-Baca, Oliver Eulenstein. An ILP solution for the gene duplication problem, BMC Bioinformatics, 2011, pp. S14, 12, DOI: 10.1186/1471-2105-12-S1-S14