Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biologydirect.com/content/pdf/1745-6150-7-46.pdf

Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer

Biology Direct Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer Yuri I Wolf 0 Kira S Makarova 0 Natalya Yutin 0 Eugene V Koonin 0 0 National Center for Biotechnology Information, NLM, National Institutes of Health , Bethesda, MD 20894 , USA Background: Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation of new genomes. Initially, COGs were made for all complete genomes of cellular life forms that were available at the time. However, with the accumulation of thousands of complete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections. Previously, we reported the collection of COGs for 41 genomes of Archaea (arCOGs). Here we present a major update of the arCOGs and describe evolutionary reconstructions to reveal general trends in the evolution of Archaea. Results: The updated version of the arCOG database incorporates 91% of the pangenome of 120 archaea (251,032 protein-coding genes altogether) into 10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of the genome content of archaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most of the extant archaea, probably with over 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominated by gene loss resulting in genome streamlining. Overall, in the evolution of Archaea as well as a representative set of bacteria that was similarly analyzed for comparison, gene losses are estimated to outnumber gene gains at least 4 to 1. Analysis of specific patterns of gene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others, on the whole, gene exchange between major groups of Archaea appears to be largely random, with no major 'highways' of horizontal gene transfer. Conclusions: The updated collection of arCOGs is expected to become a key resource for comparative genomics, evolutionary reconstruction and functional annotation of new archaeal genomes. Given that, in spite of the major increase in the number of genomes, the conserved core of archaeal genes appears to be stabilizing, the major evolutionary trends revealed here have a chance to stand the test of time. Archaea; Orthologs; Horizontal gene transfer - Background A genome-wide evolutionary classification of genes is essential for the entire enterprise of genomics including both functional annotation and evolutionary reconstruction. The construction of such a classification for a large set of diverse genomes is never an easy task due to the complexity of evolutionary relationships between genes to which gene duplication, gene loss and horizontal gene transfer (HGT) all make major contributions. The interplay of all these evolutionary processes makes accurate delineation of orthologous and paralogous relationships between genes extremely complicated [1-3]. Accurate identification of orthologs and paralogs is central to functional characterization of genomes because orthologs typically occupy the same functional niche in different organisms whereas paralogs undergo functional diversification duplication via the processes of neofunctionalization and subfunctionalization [3-5]. Clear differentiation between orthologs and paralogs is equally important for the reconstruction of evolutionary scenarios [6-9]. In principle, orthologous and paralogous relationships between genes have to be disentangled by means of comprehensive phylogenetic analysis of entire families of homologous genes in the compared genomes [2,10-13]. However, for the case of numerous, diverse genomes, such comprehensive phylogenomic analysis remains both an extremely labor-intensive and an error-prone process. Accordingly, several methods have been developed that aim at the identification of sets of likely orthologs without performing comprehensive phylogenetic analysis; benchmark comparisons indicate that some of these methods perform as well if not, in some cases, better than phylogenomic approaches [1,14-16]. Generally, these non-phylogenomic approaches in orthology inference are based on partitioning graphs of genomespecific best hits for all genes (typically, compared in the form of protein sequences) from the analyzed set of genomes. The key underlying assumption of this approach is that the sequences of orthologous genes are more similar to each other than to the sequences of any other genes from the compared genomes. The best hit graph approach, supplemented by additional procedures for detecting co-orthologous gene sets and for treating genes encoding multidomain proteins, was first implemented in the Clusters of Orthologous Groups (COGs) of proteins [17]; the acronym COG has been subsequently reinterpreted to simply denote Clusters of Orthologous Genes [3]. The original COG set of 1997 included only 7 complete genomes, all that were available at the time [17]. The latest comprehensive COG collection released in 2003 incorporated ~70% of the protein-coding genes from 69 genomes of prokaryotes and unicellular eukaryotes [18]. The COGs have been extensively used for functional annotation of new genomes (e.g., [19,20], comparative analysis of gene neighborhoods [21-23] and other connections between genes, as implemented in the popular STRING tool [24]; target selection in structural genomics (e.g., [25]); and various genome-wide evolutionary analyses [6,8]. Subsequently, the COGs have been employed as the seed for the EggNOG database that was constructed using improved algorithms for graph-based automatic construction of orthologous gene clusters [26,27]. The methods for the construction of COGs and other, similar clusters of putative orthologous genes cannot guarantee correct identification of the orthologous and paralogous relationships between genes due to the aforementioned complexity of the evolutionary processes. The original COG analysis of small numbers of genomes involved the final step of manual curation that was important for detecting and resolving problems that were not adequately addressed by the automatic procedure. This step ceased to be feasible with the rapid increase in the number of sequenced genomes whereas the computational cost of the analysis has steeply increased. Therefore, along with the development of improved, lower complexity algorithms for identification of orthologous gene clusters [1,15,16], several smaller scale projects have been conducted in which COGs were constructed, annotated and analyzed in detail for compact groups of bacteria such as the T (...truncated)