Tree of Life Based on Genome Context Networks
Citation: Ding G, Yu Z, Zhao J, Wang Z, Li Y, et al. (
Tree of Life Based on Genome Context Networks
Guohui Ding 0
Zhonghao Yu 0
Jing Zhao 0
Zhen Wang 0
Yun Li 0
Xiaobin Xing 0
Chuan Wang 0
Lei Liu 0
Yixue Li 0
Alan Christoffels, University of Western Cape, South Africa
0 1 Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai , People's Republic of China, 2 Graduate School of the Chinese Academy of Sciences , Shanghai , People's Republic of China, 3 College of Life Science & Biotechnology, Shanghai Jiao Tong University , Shanghai , People's Republic of China, 4 Shanghai Center for Bioinformation Technology , Shanghai , People's Republic of China
Efforts in phylogenomics have greatly improved our understanding of the backbone tree of life. However, due to the systematic error in sequence data, a sequence-based phylogenomic approach leads to well-resolved but statistically significant incongruence. Thus, independent test of current phylogenetic knowledge is required. Here, we have devised a distance-based strategy to reconstruct a highly resolved backbone tree of life, on the basis of the genome context networks of 195 fully sequenced representative species. Along with strongly supporting the monophylies of three superkingdoms and most taxonomic sub-divisions, the derived tree also suggests some intriguing results, such as high G+C gram positive origin of Bacteria, classification of Symbiobacterium thermophilum and Alcanivorax borkumensis in Firmicutes. Furthermore, simulation analyses indicate that addition of more gene relationships with high accuracy can greatly improve the resolution of the phylogenetic tree. Our results demonstrate the feasibility of the reconstruction of highly resolved phylogenetic tree with extensible gene networks across all three domains of life. This strategy also implies that the relationships between the genes (gene network) can define what kind of species it is.
-
Funding: The 973 National Key Basic Research Program of China (grant no. 2006CB910705, 2003CB715901, 2002CB512801). The funders had no role in study
design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
. These authors contributed equally to this work.
A highly resolved tree of life is a useful tool for biologist to make
inferences about the dynamic processes of biological phenomena
and to present evolutionary explanations [1]. Even though the
horizontal gene transfer (HGT) is challenging the concept of tree
of life and suggests using ticket-like network to depict evolution
[2,3], the backbone of the tree of life is intact [4], revealing the
prevailing trend in the evolution of genome-scale gene sets or
species [5]. This intact backbone tree could be inferred from the
whole genome information.
To construct a species tree rather than gene trees, several
phylogenomic methods were developed (reviewed in [6]). However,
due to the compositional bias in sequence and rate variation bias
across lineages and within sites [6,7], a sequence-based phylogenomic
approach leads to well-resolved but statistically significant
incongruence, and questions that are not resolved by a kilobase of sequence
are seldom resolved by a megabase [8]. In addition, phylogenetic
reconstruction methods in terms of rare genomic changes (RGC) are
limited to the production of highly resolved phylogenetic trees. This
limitation stems mainly from the difficulty of true identification of
these Hennigian markers, insufficient usage of the genomic
information and the absence of statistical evaluation [9]. Thus, more
sophisticated strategies are required to reconstruct the backbone tree
of life as well as to test it independently.
As the question from the tale of the oracle at Delphi addressed,
the relationships between the planks determine what kind of boat it
is [10]. Similarly, in the evolution of the genomes, the relationships
between the genes (gene networks), which make the genome
function in their molecular and cellular contexts, determine what
kind of species it is. Currently, with the development of
computational methods for deriving gene networks from
heterogeneous functional genomics data [11,12] and measuring the
similarity between two networks [13], it is possible to infer the tree
of life from the comparison of gene networks among species. The
guiding principle underlining this approach is that gene network is
possibly the most subtle representation of the phenotype of an
organism and vast amounts of evolutionary information may be
hidden away within it (Figure 1A and Figure S1). In order to
demonstrate the feasibility of this strategy, we have sought to
construct a tree of life by considering the information contained
within gene relationships at the genome level, as opposed to
examining primary sequence identity. Such strategy have been
tested on metabolic pathways [13,14].
Herein we employed multi-edge gene-networks to represent the
information of genomic gene relationships. These networks allow
two or more edges linking the same gene-pair (Figure 1A) and
associate evidence (e.g., the method to infer edges) as a property
for each edge. We refer to such multi-edge gene-network as a
gene relationship network (GRN). Ideally, if all the possible
relationships among genes could be obtained, this network should
be a full-information representation of an organism. Then, the
difference between GRNs can be interpreted as a consequence of
the fundamental properties of the species, which can be utilized to
explore the tree of life. In practice, however, these differences can
also be induced from the methods used to construct the networks.
For example, more gene relationships can be found in model
organisms than non-model organisms if using a literature mining
method. Hence, in the absence of ideal GRNs, un-biased methods
must be used to build the operational gene networks to
approximate the ideal gene networks. In this work, we have used
genome context networks (GCNs) in which nodes are referred as
genes and edges can be inferred from genome context, as it is the
only networks that could be constructed fairly for all
genomesequenced organisms now, to our knowledge.
Results and Discussion
By integrating phylogenetic profiles, gene fusions and gene
neighbors (Figure S2), we constructed GCNs from genome
sequences of 195 organisms (Table S1). Then, pairwise
comparison of GCNs was conducted to obtain a 1956195 distance
matrix. With this matrix, we created a phylogeny of 195 species
using the neighbor-joining algorithm [15]. To assess how strongly
the data supports the resulting tree, a specific robustness test (see
Material and Methods, Figure S3) corresponding to the traditional
bootstrapping approach in phylogenetics was employed. The
outline of this strategy is shown in Figure 1B.
Tree Topologies
Our str (...truncated)