Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods
Citation: Altenhoff AM, Dessimoz C (
Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods
Adrian M. Altenhoff 0
Christophe Dessimoz 0
Jonathan A. Eisen, University of California Davis, United States of America
0 Institute of Computational Science, ETH Zurich, and Swiss Institute of Bioinformatics , Zu rich , Switzerland
Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/ reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.
-
The identification of orthologs is an important problem in the
field of comparative genomics. Many studies, such as gene function
prediction, phylogenetic analyses, and genomics context analyses,
depend on accurate predictions of orthology. A large variety of
methods for predicting orthologs and the resulting databases have
appeared in recent years [18]. But although the accuracy of the
predictions highly impacts any downstream analyses, there are only
few comparative studies of the quality of the different prediction
algorithms [9,10]. This paucity can be attributed to at least three
major challenges. The first challenge resides in the multiple and
sometimes intrinsically conflicting definitions of orthology [1113].
The original definition of Fitch [14] is based on the evolutionary
history of genes: two genes are orthologs if they diverged through a
speciation event. On the other hand, given that orthologs often have
similar function, many people uses the term orthologs to refer to
genes with conserved function. Yet another definition is used in
some studies of genome rearrangement, in which the ortholog
refers, in the event of a duplication, to the original sequence [15],
which remains in its genomic context.
The second challenge resides in the difficulty of validating the
predictions. Take the case of phylogenetic orthology. Gene tree
inference can be a notoriously difficult task, but it is usually precisely
in difficult cases that the performances of methods can be
differentiated. Indeed, in simple cases, most methods perform equally
well. Validation of the definition based on function is not easier:
orthology is in this context arguably impossible to verify because there is
no universally applicable, unequivocal definition of conserved
function, that is, the required similarity in terms of regulation,
chemical activity, interaction partners, etc. for two genes to qualify as
orthologs often varies across studies. For instance, in some wet lab
experiments [16,17], two genes are only considered orthologs if they
have the ability to complement each others function.
The third challenge is of practical nature: to compare the
different orthology inference projects, their methods must either be
replicated on a common set of data, or the results produced by the
different databases must be mapped to each other for comparison.
Replication is not always possible, because some projects depend on
human curation, or are not documented in detail. Mapping data is
complicated by the lack of homogeneity in the sources of genomic
data used by the different projects. The resulting intersection sets are
often relatively small and may not be representative.
In the present article, we provide an in-depth comparison of the
prediction from 11 major projects, including OMA [4], our own
orthology inference effort. We try to address the aforementioned
challenges by testing phylogenetic and functional definitions of
orthologs, using a variety of tests. We took the approach of
The identification of orthologs, pairs of homologous genes
in different species that started diverging through
speciation events, is a central problem in genomics with
applications in many research areas, including comparative
genomics, phylogenetics, protein function annotation, and
genome rearrangement. An increasing number of projects
aim at inferring orthologs from complete genomes, but
little is known about their relative accuracy or coverage.
Because the exact evolutionary history of entire genomes
remains largely unknown, predictions can only be
validated indirectly, that is, in the context of the different
applications of orthology. The few comparison studies
published so far have asssessed orthology exclusively from
the expectation that orthologs have conserved protein
function. In the present work, we introduce methodology
to verify orthology in terms of phylogeny and perform a
comprehensive comparison of nine leading ortholog
inference projects and two methods using both
phylogenetic and functional tests. The results show large variations
among the different projects in terms of performances,
which indicates that the choice of orthology database can
have a strong impact on any downstream analysis.
comparing the inferred orthologs available from the different
projects, which required mapping the data between projects. The
rest of this introduction provides a description of the projects
retained here, a review on the representation of orthology in those
projects so to provide a common basis for comparison, and fina (...truncated)