PHOG: a database of supergenomes built from proteome complements
Igor V Merkeev
2
Pavel S Novichkov
0
1
Andrey A Mironov
0
0
Department of Bioengineering and Bioinformatics, Moscow State University
,
Vorob'evy gory, 1-73, Moscow, 119992
,
Russia
1
National Center for Biotechnology Information, U.S. National Library of Medicine
,
8600 Rockville Pike, Bethesda, MD 20894
,
USA
2
State Scientific Center GosNIIGenetica
,
1st Dorozhny pr., 1, Moscow, 113545
,
Russia
Background: Orthologs and paralogs are widely used terms in modern comparative genomics. Existing procedures for resolving orthologous/paralogous relationships are often based on manual revision of clusters of orthologous groups and/or lack any rigorous evolutionary base. Description: We developed a completely automated procedure that creates clusters of orthologous groups at each node of the taxonomy tree (PHOGs - Phylogenetic Orthologous Groups). As a result of this procedure, a tree of orthologous groups was obtained. Each cluster is a "supergene" and it is represented by an "ancestral" sequence obtained from the multiple alignment of orthologous and paralogous genes. The procedure has been applied to the taxonomy tree of organisms from all three domains of life. Protein complements from 50 bacterial, archaeal and eukaryotic species were used to create PHOGs at all tree nodes. 51367 PHOGs were obtained at the root node. Conclusion: The PHOG database demonstrates that it is possible to automatically process any number of sequenced genomes and to reconstruct orthologous and paralogous relationships between genomes using a rigorous evolutionary approach. This database can become a very useful tool in various areas of comparative genomics.
-
Background
Evolutionary forces acting on genomes result in gene
duplications, gene losses and gene acquisitions.
Generally, it is difficult to reconstruct the exact evolutionary
history of a protein family due to its complex nature. A
widely used approach to study such history is to find
orthologous groups by comparing completely sequenced
genomes. This approach resulted in several databases
[14] that helped to predict protein function and provided
deep insights into the protein evolution. These
procedures, however, did not fully take into account the
taxonomy tree of organisms.
Orthologs are genes derived from a single ancestral gene
as a result of the speciation event, while paralogs are genes
that result from gene duplication events [5-7]. The
usefulness of orthologs and paralogs in modern genomics
comes from the fact that the products of orthologs
generally perform the same function while the products of
paralogs perform a similar function. We can give several
examples how the knowledge of orthologs and paralogs
EFvigoulurteio1n by gene duplication
Evolution by gene duplication. Nodes N1, N2, N3 represent
speciation events resulting in orthologs. Filled circles ( )
mark gene duplication events resulting in paralogs.
helped to solve some difficult issues. Comparative studies
of bacterial transcriptional regulation often use orthologs
assuming that orthologs tend to be regulated in the same
way [8-10]. It is possible to predict functional coupling
between genes if orthologs of genes forming a functional
cluster in one organism will form a cluster in another
organism [11]. Leonid Mirny and Mikhail Gelfand [12]
have found specificity-determining positions in the LacI/
PuR family of bacterial transcription factors looking for
residues that are conserved among orthologs and are
different in paralogs. Orthologs and paralogs also help to
understand the evolution by gene duplication, which is
thought to be a major force in creating organismal
complexity [13,14]. If clusters of orthologous groups are
found that contain mainly genes from a particular group
of organisms [15,16], it is possible to better understand
physiology specific for this group of organisms.
Fig. 1 shows what issues might arise where resolving the
orthologous/paralogous relationships between genes. An
ancestral gene A creates a family of genes A1, A2, A3, A4, A5,
A6, A7 by three speciation events N1, N1, N3 and two gene
duplication events. The real evolution of gene families is
far more complex than this simple example creating a
complex network of orthologs and paralogs. A gene is
considered to be an ortholog or a paralog relative to a
particular node N of the evolutionary tree if its ancestor at the
child node following the node N is a result of a speciation
event or a gene duplication event correspondingly. For
instance, the gene A3 is an ortholog to the gene A5 since
they both are the result of the speciation event occurred at
the node N3, while this gene is a paralog to the gene
A1because it is the result of a gene duplication event
occurred after the speciation event at the node N2. How
can we resolve these relationships for hundreds of
organisms having thousands of genes? To correctly resolve
orthologs and paralogs, we suggest that clusters of
orthologous genes should be defined at each node of the
taxonomy tree of organisms. Indeed, if such clusters are
obtained for the tree in Fig. 1, then it will be clearer how
to reconstruct the evolutionary history of the protein
family A. At the node N3, the genes A3, A5 will form one
independent orthologous group since they were derived from
some ancestral gene A35, and the genes A4, A6 will form
another independent orthologous group since they were
derived from some ancestral gene A46. We can consider
the pairwise alignment built form A3 and A5 as a
representative of their ancestral gene A35. The same is true for the
genes A4 and A6. Extending this idea of grouping genes to
represent their ancestors, we can say that at the node N2
the genes A1, A2, A3, A4, A5 and A6 will form their own
independent orthologous group. In this orthologous
group the gene A1 and the orthologous group (A4, A6)
from the node N3 will be orthologs, and the gene A2 and
the orthologous group (A3, A5) from the node N3will be
paralogs.
Our procedure is based on the direct definition of
orthologs and paralogs and utilizes the following idea. If
we have several species with their proteomes at one node
of the taxonomy tree of organims, we can find orthologs
by running a similarity search procedure (e.g. BLAST)
between each pair of species, find bi-directional best hits
(BBHs), and choose orthologs from BBHs using some
system of rules. Then it is possible to find paralogs in each
species by finding genes that are not declared orthologs
and which have the statistically significant best hit to an
already found orthologous group. Then we can form a
new "genome", putting into it all orthologous families
and genes that did not find any match. Since this new
"genome" is an artificial construct and it includes all genes
from both species, this new genome is called a
supergenome built from protein complements of both species. In
the same way, we can also find orthologs and paralogs
between two supergenomes and build a next level
supergenome. Repeating the proced (...truncated)