The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/41/W1/W242.full.pdf

The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification

W242–W248 Nucleic Acids Research, 2013, Vol. 41, Web Server issue doi:10.1093/nar/gkt399 Published online 18 May 2013 The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification Cyrus Afrasiabi1, Bushra Samad2, David Dineen1, Christopher Meacham1 and Kimmen Sjölander1,2,* 1 QB3 Institute, University of California, Berkeley, Berkeley, CA 94720-1762, USA and 2Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720-1762, USA Received February 9, 2013; Revised April 8, 2013; Accepted April 19, 2013 ABSTRACT INTRODUCTION FAT-CAT (Fast Approximate Tree Classiﬁcation) is a web server for protein functional annotation and *To whom correspondence should be addressed. Tel: +1 510 642 9932; Fax: +1 510 642 5835; Email: Present address: Kimmen Sjölander, Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720-1762, USA. ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. The PhyloFacts ‘Fast Approximate Tree Classification’ (FAT-CAT) web server provides a novel approach to ortholog identification using subtree hidden Markov model-based placement of protein sequences to phylogenomic orthology groups in the PhyloFacts database. Results on a data set of microbial, plant and animal proteins demonstrate FAT-CAT’s high precision at separating orthologs and paralogs and robustness to promiscuous domains. We also present results documenting the precision of ortholog identification based on subtree hidden Markov model scoring. The FAT-CAT phylogenetic placement is used to derive a functional annotation for the query, including confidence scores and drill-down capabilities. PhyloFacts’ broad taxonomic and functional coverage, with >7.3 M proteins from across the Tree of Life, enables FAT-CAT to predict orthologs and assign function for most sequence inputs. Four pipeline parameter presets are provided to handle different sequence types, including partial sequences and proteins containing promiscuous domains; users can also modify individual parameters. PhyloFacts trees matching the query can be viewed interactively online using the PhyloScope Javascript tree viewer and are hyperlinked to various external databases. The FAT-CAT web server is available at http://phylogenomics. berkeley.edu/phylofacts/fatcat/. identiﬁcation of orthologs. Orthology relationships are used in many bioinformatics analyses, including functional annotation of genomes, phylogenetic proﬁle construction, prediction of protein–protein interaction and phylogenetic studies. The FAT-CAT web server achieves broad taxonomic and functional coverage by making use of pre-calculated phylogenetic trees in the PhyloFacts database (1). FAT-CAT precision is due to the use of hidden Markov models (HMMs) at every node of every tree, allowing highly ﬂexible prediction of function at all levels of a functional hierarchy. PhyloFacts includes trees for many Pfam-A domains (2) and multi-domain architectures (MDAs), with >7.3 M proteins from across the Tree of Life clustered into 92.8 K families. PhyloFacts integrates experimental and annotation data from different resources including SwissProt, the Gene Ontology, Pfam, BioCyc, Enzyme Commission and third-party orthology databases. These data are used to derive a proﬁle of functional descriptions at each subtree node in the PhyloFacts database and to provide functional annotations for usersupplied query sequences. The input to the FAT-CAT web server is a protein sequence; the maximum sequence length allowed is 2000 amino acids. The FAT-CAT pipeline proceeds through a series of analyses to select a set of subtrees from which candidate orthologs are identiﬁed and functional annotations are derived. Four preset pipeline parameters options—high recall, high precision, remote homolog detection and partial sequence search—are designed to handle different types of inputs and to accommodate user preferences for either high recall or high precision. FAT-CAT default parameters are set for high recall, as these are effective on most inputs and are robust to small gene model errors and/or structural variants. High-precision parameter settings restrict predicted orthologs to those that align globally to the query with high sequence identity; we recommend these settings when the query Nucleic Acids Research, 2013, Vol. 41, Web Server issue W243 MATERIALS AND METHODS The PhyloFacts database PhyloFacts 3.0 includes >7.3 M proteins from 99 K unique taxa across Bacteria, Archaea and Eukarya clustered into 92.8 K protein families. The number of sequences per genome in PhyloFacts follows a power law: 95.6 K taxa have 100 sequences, 1.2 K have between 101 and 1 K, and 2.2 K genomes have >1 K sequences each. PhyloFacts families represent both individual Pfam domains and MDAs (multi-domain architectures, homology clusters where sequences align globally); sequences are drawn from the UniProt database, including both whole and partly sequenced genomes. Each family has a multiple sequence alignment (MSA), phylogenetic tree, predicted orthologs, HMMs and annotation data drawn from various sources. The PhyloFacts 3.0 library construction pipeline differs slightly from the pipeline used in release 2.0 (1): ﬁrst, PhyloFacts 3.0 includes trees for Pfam domains (described later in the text); second, PhyloFacts 3.0 trees are constructed using FastTree (3). PhyloFacts family pages and the overall website have been redesigned for easier navigation and interpretation. PhyloFacts MDA families We use the FlowerPower algorithm (4) to cluster sequences sharing a common MDA. FlowerPower is an iterated homology clustering algorithm that uses Subfamily Classiﬁcation in Phylogenomics (SCI-PHY) (5) to identify subfamilies and subfamily HMMs to select and align new sequences. In each iteration, as new sequences are retrieved and aligned, FlowerPower examines the alignment of candidate family members for agreement with the family consensus structure. The resulting cluster of homologs has both high precision and recall in clustering sequences into MDA classes (4). PhyloFacts-Pfam We provide trees for Pfam domains, and orthologs derived based on analysis of these trees, for two reasons. First, Pfam domains provide important clues to the functions of proteins. Second, domain-based phylogenies are often better resolved than those based on multiple sequence alignments requiring sequences to align globally. A stringent requirement of global alignment can reject actual orthologs having relatively small gene model errors. Constructing trees for Pfam domains (requiring only local matches within proteins) allows us to relax th (...truncated)