ASTRID: Accurate Species TRees from Internode Distances (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2164-16-S10-S3.pdf

ASTRID: Accurate Species TRees from Internode Distances

Vachaspati and Warnow BMC Genomics 2015, 16(Suppl 10):S3 http://www.biomedcentral.com/1471-2164/16/S10/S3 RESEARCH Open Access ASTRID: Accurate Species TRees from Internode Distances Pranjal Vachaspati, Tandy Warnow* From 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics Frankfurt, Germany. 4-7 October 2015 Abstract Background: Incomplete lineage sorting (ILS), modelled by the multi-species coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL-2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL-2, and cannot run on some datasets. Results: We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distance-based tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL-2. Furthermore, ASTRID is much faster than ASTRAL-2, completing in minutes on some datasets for which ASTRAL-2 used hours. Conclusions: ASTRID is a new coalescent-based method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github. Background Species tree estimation in the presence of gene tree incongruence is a major challenge for many biological analyses. Gene tree incongruence can result from a variety of processes, notably incomplete lineage sorting (ILS) [1], which is modelled by the multispecies coalescent (MSC) [2]. Concatenated maximum likelihood analyses is generally the most common method for species tree estimation from multiple loci, but can be statistically inconsistent, and even positively misleading, in some cases [3], thus converging to an incorrect tree with increasing amounts of sequence data. In recent years, a number of species tree estimation methods have been developed that are statistically consistent under the MSC, and so will converge in probability to the true species trees as the amount of data increases; see [4-6]. Methods that are statistically consistent under the MSC include ASTRAL [7], ASTRAL-2 [8], *BEAST [9], * Correspondence: Department of Computer Science, University of Illinois at UrbanaChampaign, 201 N. Goodwin Avenue, Urbana, IL, 61801 USA BEST [10], the population tree from BUCKy [11], METAL [12], MP-EST [13], NJst [14], SNAPP [15], STEAC [16], STEM [17], and SVDquartets [18]. While little is yet known about some of these methods (either because they have not yet been adequately studied or because they are not yet implemented), only a few of them (MP-EST, NJst, and ASTRAL-2) have been shown to be able to analyze very large datasets (especially those with large numbers of taxa) with high accuracy. MP-EST has been used more than either NJst or ASTRAL-2, but NJst is more accurate than MP-EST, and ASTRAL-2 is more accurate than both [8]. Furthermore, the currently available implementation of NJst is slower than ASTRAL-2, and cannot run on some datasets [8,14]. In this paper, we present ASTRID, a new ILS-aware distance-based method for species tree estimation. Our approach is based on NJst, but is substantially faster, and, unlike NJst, functions even when each gene tree contains only a small portion of the data. The input to NJst is a set of unrooted gene trees. In the first step, an n × n matrix D [x, y] is computed, where D[x, y] is the average distance © 2015 Vachaspati and Warnow This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Vachaspati and Warnow BMC Genomics 2015, 16(Suppl 10):S3 http://www.biomedcentral.com/1471-2164/16/S10/S3 (in terms of number of edges) between x and y among all the gene trees. In the second step, neighbor joining [19], a very popular distance-based method of phylogeny estimation, is used to produce the species tree. ASTRID improves on NJst by enabling other distancebased methods to be used in the second step. In particular, although NJ cannot be run on datasets with missing entries, other distance-based methods can, and ASTRID enables the use of these other methods. We also explore the use of more accurate distance-based methods. Thus, ASTRID is a very simple modification to NJst. As we will show, ASTRID is much faster than NJst. The comparison between ASTRID and ASTRAL-2 and MP-EST, two established coalescent-based summary methods, is also interesting. ASTRID completed in minutes on some datasets where the other methods took hours, and was fast enough to analyze datasets with 1000 species and 1000 genes on a single processor within an hour (ASTRAL-2 and MP-EST take much more time on datasets of this size). Furthermore, ASTRID clearly dominates MP-EST in terms of accuracy, and is competitive with ASTRAL-2 (more accurate in some cases, and less accurate in others). Finally, ASTRID has desirable theoretical properties: it runs in polynomial time, and it remains statistically consistent under the MSC model without assuming the molecular clock, nor requiring rooted gene trees as input. Methods ASTRID The input to ASTRID is a set of unrooted gene trees T1, ..., T k . We let S = L(Ti ) denote the leafset of T i , and S = ∪i L(Ti ). Let |S| = n. Step 1: Construct n × n matrix M̄: 1 For all i = 1, 2, ..., k, compute n × n matrix Mi, as follows. For pairs p, q of species where both are in Si, set Mi(p, q) to be the number of edges in the path between p and q in Ti. For all other pairs p, q (i.e., where one or both are not in Si), set M i(p, q) = 0. Thus, the only non-zero entries in Mi are for pairs of species in Ti. 2 For all {p, q} ⊂ S, let n(p, q) be the number of trees Ti that contain both p and q. 3 Define n × n matrix M̄ by setting Mi (p, q) if n(p, q) >0, and M̄[p, q] = −1 M̄(p, q) = i n(p, q) (to denote a missing value) otherwise. Step 2: Compute tree on M̄ using a selected distancebased method Page 2 of 13 ASTRID on the mammalian biological dataset of 37 species, originally studied in [20]. Here we briefly describe the simulation procedures used to generate these datasets, and provide empirical statistics for the datasets in Table 1. See the original publications for details about the simulation protocols, and our supplementary online m (...truncated)