Tree of Life Based on Genome Context Networks (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0003357&type=printable

Tree of Life Based on Genome Context Networks

Citation: Ding G, Yu Z, Zhao J, Wang Z, Li Y, et al. ( Tree of Life Based on Genome Context Networks Guohui Ding 0 Zhonghao Yu 0 Jing Zhao 0 Zhen Wang 0 Yun Li 0 Xiaobin Xing 0 Chuan Wang 0 Lei Liu 0 Yixue Li 0 Alan Christoffels, University of Western Cape, South Africa 0 1 Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai , People's Republic of China, 2 Graduate School of the Chinese Academy of Sciences , Shanghai , People's Republic of China, 3 College of Life Science & Biotechnology, Shanghai Jiao Tong University , Shanghai , People's Republic of China, 4 Shanghai Center for Bioinformation Technology , Shanghai , People's Republic of China Efforts in phylogenomics have greatly improved our understanding of the backbone tree of life. However, due to the systematic error in sequence data, a sequence-based phylogenomic approach leads to well-resolved but statistically significant incongruence. Thus, independent test of current phylogenetic knowledge is required. Here, we have devised a distance-based strategy to reconstruct a highly resolved backbone tree of life, on the basis of the genome context networks of 195 fully sequenced representative species. Along with strongly supporting the monophylies of three superkingdoms and most taxonomic sub-divisions, the derived tree also suggests some intriguing results, such as high G+C gram positive origin of Bacteria, classification of Symbiobacterium thermophilum and Alcanivorax borkumensis in Firmicutes. Furthermore, simulation analyses indicate that addition of more gene relationships with high accuracy can greatly improve the resolution of the phylogenetic tree. Our results demonstrate the feasibility of the reconstruction of highly resolved phylogenetic tree with extensible gene networks across all three domains of life. This strategy also implies that the relationships between the genes (gene network) can define what kind of species it is. - Funding: The 973 National Key Basic Research Program of China (grant no. 2006CB910705, 2003CB715901, 2002CB512801). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. . These authors contributed equally to this work. A highly resolved tree of life is a useful tool for biologist to make inferences about the dynamic processes of biological phenomena and to present evolutionary explanations [1]. Even though the horizontal gene transfer (HGT) is challenging the concept of tree of life and suggests using ticket-like network to depict evolution [2,3], the backbone of the tree of life is intact [4], revealing the prevailing trend in the evolution of genome-scale gene sets or species [5]. This intact backbone tree could be inferred from the whole genome information. To construct a species tree rather than gene trees, several phylogenomic methods were developed (reviewed in [6]). However, due to the compositional bias in sequence and rate variation bias across lineages and within sites [6,7], a sequence-based phylogenomic approach leads to well-resolved but statistically significant incongruence, and questions that are not resolved by a kilobase of sequence are seldom resolved by a megabase [8]. In addition, phylogenetic reconstruction methods in terms of rare genomic changes (RGC) are limited to the production of highly resolved phylogenetic trees. This limitation stems mainly from the difficulty of true identification of these Hennigian markers, insufficient usage of the genomic information and the absence of statistical evaluation [9]. Thus, more sophisticated strategies are required to reconstruct the backbone tree of life as well as to test it independently. As the question from the tale of the oracle at Delphi addressed, the relationships between the planks determine what kind of boat it is [10]. Similarly, in the evolution of the genomes, the relationships between the genes (gene networks), which make the genome function in their molecular and cellular contexts, determine what kind of species it is. Currently, with the development of computational methods for deriving gene networks from heterogeneous functional genomics data [11,12] and measuring the similarity between two networks [13], it is possible to infer the tree of life from the comparison of gene networks among species. The guiding principle underlining this approach is that gene network is possibly the most subtle representation of the phenotype of an organism and vast amounts of evolutionary information may be hidden away within it (Figure 1A and Figure S1). In order to demonstrate the feasibility of this strategy, we have sought to construct a tree of life by considering the information contained within gene relationships at the genome level, as opposed to examining primary sequence identity. Such strategy have been tested on metabolic pathways [13,14]. Herein we employed multi-edge gene-networks to represent the information of genomic gene relationships. These networks allow two or more edges linking the same gene-pair (Figure 1A) and associate evidence (e.g., the method to infer edges) as a property for each edge. We refer to such multi-edge gene-network as a gene relationship network (GRN). Ideally, if all the possible relationships among genes could be obtained, this network should be a full-information representation of an organism. Then, the difference between GRNs can be interpreted as a consequence of the fundamental properties of the species, which can be utilized to explore the tree of life. In practice, however, these differences can also be induced from the methods used to construct the networks. For example, more gene relationships can be found in model organisms than non-model organisms if using a literature mining method. Hence, in the absence of ideal GRNs, un-biased methods must be used to build the operational gene networks to approximate the ideal gene networks. In this work, we have used genome context networks (GCNs) in which nodes are referred as genes and edges can be inferred from genome context, as it is the only networks that could be constructed fairly for all genomesequenced organisms now, to our knowledge. Results and Discussion By integrating phylogenetic profiles, gene fusions and gene neighbors (Figure S2), we constructed GCNs from genome sequences of 195 organisms (Table S1). Then, pairwise comparison of GCNs was conducted to obtain a 1956195 distance matrix. With this matrix, we created a phylogeny of 195 species using the neighbor-joining algorithm [15]. To assess how strongly the data supports the resulting tree, a specific robustness test (see Material and Methods, Figure S3) corresponding to the traditional bootstrapping approach in phylogenetics was employed. The outline of this strategy is shown in Figure 1B. Tree Topologies Our str (...truncated)