Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer
Biology Direct
Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer
Yuri I Wolf 0
Kira S Makarova 0
Natalya Yutin 0
Eugene V Koonin 0
0 National Center for Biotechnology Information, NLM, National Institutes of Health , Bethesda, MD 20894 , USA
Background: Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation of new genomes. Initially, COGs were made for all complete genomes of cellular life forms that were available at the time. However, with the accumulation of thousands of complete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections. Previously, we reported the collection of COGs for 41 genomes of Archaea (arCOGs). Here we present a major update of the arCOGs and describe evolutionary reconstructions to reveal general trends in the evolution of Archaea. Results: The updated version of the arCOG database incorporates 91% of the pangenome of 120 archaea (251,032 protein-coding genes altogether) into 10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of the genome content of archaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most of the extant archaea, probably with over 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominated by gene loss resulting in genome streamlining. Overall, in the evolution of Archaea as well as a representative set of bacteria that was similarly analyzed for comparison, gene losses are estimated to outnumber gene gains at least 4 to 1. Analysis of specific patterns of gene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others, on the whole, gene exchange between major groups of Archaea appears to be largely random, with no major 'highways' of horizontal gene transfer. Conclusions: The updated collection of arCOGs is expected to become a key resource for comparative genomics, evolutionary reconstruction and functional annotation of new archaeal genomes. Given that, in spite of the major increase in the number of genomes, the conserved core of archaeal genes appears to be stabilizing, the major evolutionary trends revealed here have a chance to stand the test of time.
Archaea; Orthologs; Horizontal gene transfer
-
Background
A genome-wide evolutionary classification of genes is
essential for the entire enterprise of genomics including
both functional annotation and evolutionary
reconstruction. The construction of such a classification for a large
set of diverse genomes is never an easy task due to the
complexity of evolutionary relationships between genes
to which gene duplication, gene loss and horizontal gene
transfer (HGT) all make major contributions. The
interplay of all these evolutionary processes makes accurate
delineation of orthologous and paralogous relationships
between genes extremely complicated [1-3]. Accurate
identification of orthologs and paralogs is central to
functional characterization of genomes because
orthologs typically occupy the same functional niche in
different organisms whereas paralogs undergo functional
diversification duplication via the processes of
neofunctionalization and subfunctionalization [3-5]. Clear
differentiation between orthologs and paralogs is equally
important for the reconstruction of evolutionary
scenarios [6-9].
In principle, orthologous and paralogous relationships
between genes have to be disentangled by means of
comprehensive phylogenetic analysis of entire families of
homologous genes in the compared genomes [2,10-13].
However, for the case of numerous, diverse genomes,
such comprehensive phylogenomic analysis remains
both an extremely labor-intensive and an error-prone
process. Accordingly, several methods have been
developed that aim at the identification of sets of likely
orthologs without performing comprehensive phylogenetic
analysis; benchmark comparisons indicate that some of
these methods perform as well if not, in some cases,
better than phylogenomic approaches [1,14-16]. Generally,
these non-phylogenomic approaches in orthology
inference are based on partitioning graphs of
genomespecific best hits for all genes (typically, compared in the
form of protein sequences) from the analyzed set of
genomes. The key underlying assumption of this approach
is that the sequences of orthologous genes are more
similar to each other than to the sequences of any other
genes from the compared genomes.
The best hit graph approach, supplemented by
additional procedures for detecting co-orthologous gene sets
and for treating genes encoding multidomain proteins,
was first implemented in the Clusters of Orthologous
Groups (COGs) of proteins [17]; the acronym COG has
been subsequently reinterpreted to simply denote
Clusters of Orthologous Genes [3]. The original COG set of
1997 included only 7 complete genomes, all that were
available at the time [17]. The latest comprehensive
COG collection released in 2003 incorporated ~70% of
the protein-coding genes from 69 genomes of
prokaryotes and unicellular eukaryotes [18]. The COGs have
been extensively used for functional annotation of new
genomes (e.g., [19,20], comparative analysis of gene
neighborhoods [21-23] and other connections between
genes, as implemented in the popular STRING tool [24];
target selection in structural genomics (e.g., [25]); and
various genome-wide evolutionary analyses [6,8].
Subsequently, the COGs have been employed as the seed for
the EggNOG database that was constructed using
improved algorithms for graph-based automatic
construction of orthologous gene clusters [26,27].
The methods for the construction of COGs and other,
similar clusters of putative orthologous genes cannot
guarantee correct identification of the orthologous and
paralogous relationships between genes due to the
aforementioned complexity of the evolutionary processes.
The original COG analysis of small numbers of genomes
involved the final step of manual curation that was
important for detecting and resolving problems that were
not adequately addressed by the automatic procedure.
This step ceased to be feasible with the rapid increase in
the number of sequenced genomes whereas the
computational cost of the analysis has steeply increased.
Therefore, along with the development of improved, lower
complexity algorithms for identification of orthologous
gene clusters [1,15,16], several smaller scale projects
have been conducted in which COGs were constructed,
annotated and analyzed in detail for compact groups of
bacteria such as the T (...truncated)