The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification
W242–W248 Nucleic Acids Research, 2013, Vol. 41, Web Server issue
doi:10.1093/nar/gkt399
Published online 18 May 2013
The PhyloFacts FAT-CAT web server: ortholog
identification and function prediction using fast
approximate tree classification
Cyrus Afrasiabi1, Bushra Samad2, David Dineen1, Christopher Meacham1 and
Kimmen Sjölander1,2,*
1
QB3 Institute, University of California, Berkeley, Berkeley, CA 94720-1762, USA and 2Department of
Bioengineering, University of California, Berkeley, Berkeley, CA 94720-1762, USA
Received February 9, 2013; Revised April 8, 2013; Accepted April 19, 2013
ABSTRACT
INTRODUCTION
FAT-CAT (Fast Approximate Tree Classification) is a
web server for protein functional annotation and
*To whom correspondence should be addressed. Tel: +1 510 642 9932; Fax: +1 510 642 5835; Email:
Present address:
Kimmen Sjölander, Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720-1762, USA.
ß The Author(s) 2013. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
The PhyloFacts ‘Fast Approximate Tree Classification’ (FAT-CAT) web server provides a novel
approach to ortholog identification using subtree
hidden Markov model-based placement of protein
sequences to phylogenomic orthology groups in
the PhyloFacts database. Results on a data set of
microbial, plant and animal proteins demonstrate
FAT-CAT’s high precision at separating orthologs
and paralogs and robustness to promiscuous
domains. We also present results documenting the
precision of ortholog identification based on subtree
hidden Markov model scoring. The FAT-CAT phylogenetic placement is used to derive a functional annotation for the query, including confidence scores
and drill-down capabilities. PhyloFacts’ broad taxonomic and functional coverage, with >7.3 M proteins
from across the Tree of Life, enables FAT-CAT to
predict orthologs and assign function for most
sequence inputs. Four pipeline parameter presets
are provided to handle different sequence types,
including partial sequences and proteins containing
promiscuous domains; users can also modify individual parameters. PhyloFacts trees matching the
query can be viewed interactively online using the
PhyloScope Javascript tree viewer and are hyperlinked to various external databases. The FAT-CAT
web server is available at http://phylogenomics.
berkeley.edu/phylofacts/fatcat/.
identification of orthologs. Orthology relationships are
used in many bioinformatics analyses, including functional annotation of genomes, phylogenetic profile construction, prediction of protein–protein interaction and
phylogenetic studies. The FAT-CAT web server achieves
broad taxonomic and functional coverage by making use
of pre-calculated phylogenetic trees in the PhyloFacts
database (1). FAT-CAT precision is due to the use of
hidden Markov models (HMMs) at every node of every
tree, allowing highly flexible prediction of function at all
levels of a functional hierarchy. PhyloFacts includes trees
for many Pfam-A domains (2) and multi-domain architectures (MDAs), with >7.3 M proteins from across the Tree
of Life clustered into 92.8 K families. PhyloFacts integrates experimental and annotation data from different
resources including SwissProt, the Gene Ontology, Pfam,
BioCyc, Enzyme Commission and third-party orthology
databases. These data are used to derive a profile of functional descriptions at each subtree node in the PhyloFacts
database and to provide functional annotations for usersupplied query sequences.
The input to the FAT-CAT web server is a protein
sequence; the maximum sequence length allowed is 2000
amino acids. The FAT-CAT pipeline proceeds through a
series of analyses to select a set of subtrees from which
candidate orthologs are identified and functional annotations are derived. Four preset pipeline parameters
options—high recall, high precision, remote homolog detection and partial sequence search—are designed to
handle different types of inputs and to accommodate
user preferences for either high recall or high precision.
FAT-CAT default parameters are set for high recall, as
these are effective on most inputs and are robust to small
gene model errors and/or structural variants. High-precision parameter settings restrict predicted orthologs to
those that align globally to the query with high sequence
identity; we recommend these settings when the query
Nucleic Acids Research, 2013, Vol. 41, Web Server issue W243
MATERIALS AND METHODS
The PhyloFacts database
PhyloFacts 3.0 includes >7.3 M proteins from 99 K
unique taxa across Bacteria, Archaea and Eukarya clustered into 92.8 K protein families. The number of sequences per genome in PhyloFacts follows a power law:
95.6 K taxa have 100 sequences, 1.2 K have between 101
and 1 K, and 2.2 K genomes have >1 K sequences each.
PhyloFacts families represent both individual Pfam
domains and MDAs (multi-domain architectures,
homology clusters where sequences align globally); sequences are drawn from the UniProt database, including
both whole and partly sequenced genomes. Each family
has a multiple sequence alignment (MSA), phylogenetic
tree, predicted orthologs, HMMs and annotation data
drawn from various sources. The PhyloFacts 3.0 library
construction pipeline differs slightly from the pipeline used
in release 2.0 (1): first, PhyloFacts 3.0 includes trees for
Pfam domains (described later in the text); second,
PhyloFacts 3.0 trees are constructed using FastTree (3).
PhyloFacts family pages and the overall website have been
redesigned for easier navigation and interpretation.
PhyloFacts MDA families
We use the FlowerPower algorithm (4) to cluster sequences sharing a common MDA. FlowerPower is an
iterated homology clustering algorithm that uses
Subfamily Classification in Phylogenomics (SCI-PHY)
(5) to identify subfamilies and subfamily HMMs to
select and align new sequences. In each iteration, as new
sequences are retrieved and aligned, FlowerPower
examines the alignment of candidate family members for
agreement with the family consensus structure. The
resulting cluster of homologs has both high precision
and recall in clustering sequences into MDA classes (4).
PhyloFacts-Pfam
We provide trees for Pfam domains, and orthologs derived
based on analysis of these trees, for two reasons. First,
Pfam domains provide important clues to the functions
of proteins. Second, domain-based phylogenies are often
better resolved than those based on multiple sequence
alignments requiring sequences to align globally. A stringent requirement of global alignment can reject actual
orthologs having relatively small gene model errors.
Constructing trees for Pfam domains (requiring only
local matches within proteins) allows us to relax th (...truncated)