PHOG: a database of supergenomes built from proteome complements (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2148-6-52.pdf

PHOG: a database of supergenomes built from proteome complements

Igor V Merkeev 2 Pavel S Novichkov 0 1 Andrey A Mironov 0 0 Department of Bioengineering and Bioinformatics, Moscow State University , Vorob'evy gory, 1-73, Moscow, 119992 , Russia 1 National Center for Biotechnology Information, U.S. National Library of Medicine , 8600 Rockville Pike, Bethesda, MD 20894 , USA 2 State Scientific Center GosNIIGenetica , 1st Dorozhny pr., 1, Moscow, 113545 , Russia Background: Orthologs and paralogs are widely used terms in modern comparative genomics. Existing procedures for resolving orthologous/paralogous relationships are often based on manual revision of clusters of orthologous groups and/or lack any rigorous evolutionary base. Description: We developed a completely automated procedure that creates clusters of orthologous groups at each node of the taxonomy tree (PHOGs - Phylogenetic Orthologous Groups). As a result of this procedure, a tree of orthologous groups was obtained. Each cluster is a "supergene" and it is represented by an "ancestral" sequence obtained from the multiple alignment of orthologous and paralogous genes. The procedure has been applied to the taxonomy tree of organisms from all three domains of life. Protein complements from 50 bacterial, archaeal and eukaryotic species were used to create PHOGs at all tree nodes. 51367 PHOGs were obtained at the root node. Conclusion: The PHOG database demonstrates that it is possible to automatically process any number of sequenced genomes and to reconstruct orthologous and paralogous relationships between genomes using a rigorous evolutionary approach. This database can become a very useful tool in various areas of comparative genomics. - Background Evolutionary forces acting on genomes result in gene duplications, gene losses and gene acquisitions. Generally, it is difficult to reconstruct the exact evolutionary history of a protein family due to its complex nature. A widely used approach to study such history is to find orthologous groups by comparing completely sequenced genomes. This approach resulted in several databases [14] that helped to predict protein function and provided deep insights into the protein evolution. These procedures, however, did not fully take into account the taxonomy tree of organisms. Orthologs are genes derived from a single ancestral gene as a result of the speciation event, while paralogs are genes that result from gene duplication events [5-7]. The usefulness of orthologs and paralogs in modern genomics comes from the fact that the products of orthologs generally perform the same function while the products of paralogs perform a similar function. We can give several examples how the knowledge of orthologs and paralogs EFvigoulurteio1n by gene duplication Evolution by gene duplication. Nodes N1, N2, N3 represent speciation events resulting in orthologs. Filled circles ( ) mark gene duplication events resulting in paralogs. helped to solve some difficult issues. Comparative studies of bacterial transcriptional regulation often use orthologs assuming that orthologs tend to be regulated in the same way [8-10]. It is possible to predict functional coupling between genes if orthologs of genes forming a functional cluster in one organism will form a cluster in another organism [11]. Leonid Mirny and Mikhail Gelfand [12] have found specificity-determining positions in the LacI/ PuR family of bacterial transcription factors looking for residues that are conserved among orthologs and are different in paralogs. Orthologs and paralogs also help to understand the evolution by gene duplication, which is thought to be a major force in creating organismal complexity [13,14]. If clusters of orthologous groups are found that contain mainly genes from a particular group of organisms [15,16], it is possible to better understand physiology specific for this group of organisms. Fig. 1 shows what issues might arise where resolving the orthologous/paralogous relationships between genes. An ancestral gene A creates a family of genes A1, A2, A3, A4, A5, A6, A7 by three speciation events N1, N1, N3 and two gene duplication events. The real evolution of gene families is far more complex than this simple example creating a complex network of orthologs and paralogs. A gene is considered to be an ortholog or a paralog relative to a particular node N of the evolutionary tree if its ancestor at the child node following the node N is a result of a speciation event or a gene duplication event correspondingly. For instance, the gene A3 is an ortholog to the gene A5 since they both are the result of the speciation event occurred at the node N3, while this gene is a paralog to the gene A1because it is the result of a gene duplication event occurred after the speciation event at the node N2. How can we resolve these relationships for hundreds of organisms having thousands of genes? To correctly resolve orthologs and paralogs, we suggest that clusters of orthologous genes should be defined at each node of the taxonomy tree of organisms. Indeed, if such clusters are obtained for the tree in Fig. 1, then it will be clearer how to reconstruct the evolutionary history of the protein family A. At the node N3, the genes A3, A5 will form one independent orthologous group since they were derived from some ancestral gene A35, and the genes A4, A6 will form another independent orthologous group since they were derived from some ancestral gene A46. We can consider the pairwise alignment built form A3 and A5 as a representative of their ancestral gene A35. The same is true for the genes A4 and A6. Extending this idea of grouping genes to represent their ancestors, we can say that at the node N2 the genes A1, A2, A3, A4, A5 and A6 will form their own independent orthologous group. In this orthologous group the gene A1 and the orthologous group (A4, A6) from the node N3 will be orthologs, and the gene A2 and the orthologous group (A3, A5) from the node N3will be paralogs. Our procedure is based on the direct definition of orthologs and paralogs and utilizes the following idea. If we have several species with their proteomes at one node of the taxonomy tree of organims, we can find orthologs by running a similarity search procedure (e.g. BLAST) between each pair of species, find bi-directional best hits (BBHs), and choose orthologs from BBHs using some system of rules. Then it is possible to find paralogs in each species by finding genes that are not declared orthologs and which have the statistically significant best hit to an already found orthologous group. Then we can form a new "genome", putting into it all orthologous families and genes that did not find any match. Since this new "genome" is an artificial construct and it includes all genes from both species, this new genome is called a supergenome built from protein complements of both species. In the same way, we can also find orthologs and paralogs between two supergenomes and build a next level supergenome. Repeating the proced (...truncated)