TreeFam: 2008 Update
Published online 1 December 2007
Nucleic Acids Research, 2008, Vol. 36, Database issue D735–D740
doi:10.1093/nar/gkm1005
TreeFam: 2008 Update
Jue Ruan1, Heng Li2, Zhongzhong Chen1, Avril Coghlan2, Lachlan James M. Coin3,
Yiran Guo1, Jean-Karim Hériché2, Yafeng Hu1, Karsten Kristiansen4, Ruiqiang Li1,4,
Tao Liu1, Alan Moses2, Junjie Qin1, Søren Vang5, Albert J. Vilella6, Abel Ureta-Vidal6,
Lars Bolund1,7, Jun Wang1,4,7 and Richard Durbin2,*
1
Beijing Institute of Genomics of the Chinese Academy of Sciences, Beijing Genomics Institute, Beijing 101300,
China, 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA,
3
Department of Epidemiology & Public Health, Imperial College, St Mary’s Campus, Norfolk Place, London W2
1PG, UK, 4Department of Biochemistry and Molecular Biology, University of Southern Denmark, DK-5230
Odense M, 5Research Unit for Molecular Medicine, Aarhus University Hospital and Faculty of Health Sciences,
University of Aarhus, DK-8200 Aarhus N, Denmark, 6EMBL-European Bioinformatics Institute, Hinxton, Cambridge,
UK and 7Institute of Human Genetics, University of Aarhus, DK-8000 Aarhus C, Denmark
ABSTRACT
TreeFam (http://www.treefam.org) was developed
to provide curated phylogenetic trees for all animal
gene families, as well as orthologue and paralogue
assignments. Release 4.0 of TreeFam contains
curated trees for 1314 families and automatically
generated trees for another 14 351 families. We have
expanded TreeFam to include 25 fully sequenced
animal genomes, as well as four genomes from
plant and fungal outgroup species. We have also
introduced more accurate approaches for automatically grouping genes into families, for building
phylogenetic trees, and for inferring orthologues
and paralogues. The user interface for viewing
phylogenetic trees and family information has been
improved. Furthermore, a new perl API lets users
easily extract data from the TreeFam mysql
database.
INTRODUCTION
Biologists studying a gene in one model organism often
wish to transfer functional information between species.
To do this, it is essential to know how the gene is related to
other genes in a family. Using a phylogenetic tree, it is
possible to infer orthologues—related genes in different
species that diverged at the time of a speciation event—
and paralogues, that is related genes that originated via a
duplication event within a species (1).
In his original definition of orthology, Fitch defined
orthologues in terms of a phylogenetic tree of a gene
family (1). It has now been well established that analysis of
phylogenetic trees is a very accurate way to determine
orthology (2,3), which led us to develop the TreeFam
database and accompanying website in 2005 (4). TreeFam
aims to be a curated database of phylogenetic trees of all
animal gene families, focusing on gene sets from animals
with completely sequenced genomes. In TreeFam, orthologues and paralogues are inferred from the phylogenetic
tree of a gene family. Tree-based inference of orthologues
is more robust to rate differences than BLAST-based
orthologue inference, which has been used in other
databases such as InParanoid (5), KOGs (6),
HomoloGene (7) and OrthoMCL-DB (8). Furthermore,
tree-based results can be easily visualized and for some
purpose are more informative, since gene losses and
duplications can be inferred and dated on a tree.
In addition to the databases mentioned above, many
other databases provide animal gene families on the
genome-wide scale, such as PANTHER (9), Phylofacts
(10), PhIGs (11) and SYSTERS (12). They usually display
the phylogenetic trees, but most do not computationally
infer orthologues from the gene trees. Like TreeFam,
a few databases explicitly predict orthologues based on
phylogenetic trees. These include HOGENOM (13) and
PhylomeDB (14). While HOGENOM allows users to
calculate the orthologues on the fly with a program that
connects to their database, PhylomeDB presents orthologues as directly searchable results. Furthermore, Ensembl
now collaborates with TreeFam, and uses the same
*To whom correspondence should be addressed. Tel: +44 (0) 1223 834244; Fax: +44 (0) 1223 494919; Email:
Correspondence may also be addressed to Jun Wang. Tel: +86 (0) 10 804 81664; Fax: +86 (0) 10 804 98676; Email:
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
ß 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received September 14, 2007; Revised October 21, 2007; Accepted October 23, 2007
D736 Nucleic Acids Research, 2008, Vol. 36, Database issue
tree-building and orthologue inference algorithms (15). It
is clear that the tree-based methods are theoretically
attractive, but building accurate gene trees remains a
major challenge.
In this update, we have expanded TreeFam to include
25 fully sequenced animal genomes and four outgroup
genomes. Furthermore, we have made many software
improvements since the first release of TreeFam. These
include (i) new algorithms for phylogenetic inference, (ii) a
more user-friendly website and (iii) a perl interface (API)
to the publicly available database. Together with the new
features, TreeFam is an even more useful resource for
identifying orthologues and paralogues in animal species
and for studying evolution of animal gene families.
MATERIALS AND METHODS
Seventeen new species have been added since TreeFam v1
(4). TreeFam v4 contains predicted protein sequences
from the fully sequenced genomes of 25 animal species:
human, chimpanzee, macaque, mouse, rat, cow, dog,
opossum, chicken, frog, two pufferfish (Takifugu and
Tetraodon), zebrafish, medaka, stickleback, sea squirts
(Ciona intestinalis and C. savignyi), two fruit-flies
(Drosophila melanogaster and D. pseudoobscura), two
mosquitoes (Aedes aegypti and Anopheles gambiae), the
flatworm Schistosoma mansoni, and the nematodes
Caenorhabditis elegans, C. briggsae and C. remanei. In
addition, four outgroup genomes are included: baker’s
yeast, fission yeast, rice and thale cress (Arabidopsis).
The C. briggsae and C. remanei proteins were downloaded from WormBase (16), D. pseudoobscura proteins
from FlyBase (17), fission yeast and flatworm proteins
from GeneDB (18), thale cress proteins from TIGR (19),
rice proteins from the Beijing Genomics Institute (20) and
the remaining sequences from Ensembl (15). In addition to
these species, TreeFam includes UniProt (21) proteins
from animal species whose genomes have not been fully
sequenced. For TreeFam v4, all sequences were downloaded in October 2006.
Overall strategy
TreeFam is a two-part database: a first part consisting of
automatically generated trees (TreeFam-B) and a second
part that (...truncated)