GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline

GigaScience, Mar 2018

Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families.

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/gigascience/article-pdf/7/3/giy005/24622312/giy005.pdf

GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline

GigaScience, 7, 2018, 1–10 doi: 10.1093/gigascience/giy005 Advance Access Publication Date: 7 February 2018 Technical Note TE C H N I C A L N O T E Anil S. Thanki∗ , Nicola Soranzo, Wilfried Haerty and Robert P. Davey Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK ∗ Correspondence address. Anil S. Thanki, Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, UK. E-mail: Abstract Background: Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families. Findings: A certain amount of expertise is required to configure and run the Ensembl Compara GeneTrees pipeline via command line. Therefore, we converted this pipeline into a Galaxy workflow, called GeneSeqToFamily, and provided additional functionality. This workflow uses existing tools from the Galaxy ToolShed, as well as providing additional wrappers and tools that are required to run the workflow. Conclusions: GeneSeqToFamily represents the Ensembl GeneTrees pipeline as a set of interconnected Galaxy tools, so they can be run interactively within the Galaxy’s user-friendly workflow environment while still providing the flexibility to tailor the analysis by changing configurations and tools if necessary. Additional tools allow users to subsequently visualize the gene families produced by the workflow, using the Aequatus.js interactive tool, which has been developed as part of the Aequatus software project. Keywords: Galaxy; Pipeline; Workflow; Genomics; Comparative Genomics; Homology; Orthology; Paralogy; Phylogeny; Gene Family; Alignment; Compara; Ensembl Introduction The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene families (also referred to as “orthogroups”) that comprise genes sharing common descent [1]. This plays a vital role in finding ancestral gene duplication events as well as in identifying regions under positive selection within species [2]. In order to investigate these low-level comparisons between gene families, the Ensembl Compara GeneTrees gene orthology and paralogy prediction software suite [3] was developed as a pipeline. The Ensembl GeneTrees pipeline uses TreeBest [4, 5] (part of Received: 30 March 2017; Revised: 31 July 2017; Accepted: 18 January 2018  C The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline 2 Thanki et al. Table 1: Galaxy tools used in the workflow Developed at Earlham Institute Toolsheds Tool ID Version Tool Wrapper reference Get sequences by Ensembl ID Get features by Ensembl ID Select longest coding sequence per gene ETE species tree generator GeneSeqToFamily preparation Transeq NCBI BLAST+ makeblastdb NCBI BLAST+ blastp BLAST parser hcluster sg hcluster sg parser Filter by FASTA IDs T-Coffee Tranalign TreeBeST best Gene Alignment and Family Aggregator Unique FASTA-to-Tabular UniProt ID mapping and retrieval get sequences get feature info ensembl longest cds per gene 0.1.2 0.1.2 0.0.2 Yes Yes Yes Yes Yes Yes [17] [18] [19] Yes Yes No No No Yes No Yes No No No No Yes Yes Yes No No No Yes Yes Yes No Yes No Yes Yes [20] [21] [22] [23] [23] [24] [25] [26] [27] [28] [22] [29] [30] No No No No No No [31] [32] [33] ete species tree generator gstf preparation EMBOSS: transeq101 ncbi makeblastdb ncbi blastp wrapper blast parser hcluster sg hcluster sg parser filter by fasta ids t coffee EMBOSS: tranalign100 treebest best gafa tp sorted uniq fasta2tab uniprot rest interface TreeFam [6]), which implements multiple independent phylogenetic methods and can merge the results into a consensus tree while trying to minimize duplications and deletions relative to a known species tree. This allows TreeBeST to take advantage of the fact that DNA-based methods are often more accurate for closely related parts of trees, while protein-based trees are better at longer evolutionary distances. The Ensembl GeneTrees pipeline comprises 7 steps, starting from a set of protein sequences and performing similarity searching and multiple large-scale alignments to infer homology among them, using various tools: BLAST [7], hcluster sg [8], T-Coffee [9], and phylogenetic tree construction tools, including TreeBeST. While these tools are freely available, most are specific to certain computing environments, are only usable via the command line, and require many dependencies to be fulfilled. Therefore, users are not always sufficiently expert in system administration to install, run, and debug the various tools at each stage in a chain of processes. To help ease the complexity of running the GeneTrees pipeline, we employed the Galaxy bioinformatics analysis platform to relieve the burden of managing these system-level challenges. Galaxy is an open-source framework for running a broad collection of bioinformatics tools via a user-friendly web interface [10]. No client software is required other than a recent web browser, and users are able to run tools singly or aggregated into interconnected pipelines, called “workflows”. Galaxy enables users to not only create but also share workflows with the community. In this way, it helps users who have little or no bioinformatics expertise to run potentially complex pipelines in order to analyze their own data and interrogate results within a single online platform. Furthermore, pipelines can be published in a scientific paper or in a repository such as myExperiment [11] to encourage transparency and reproducibility. In addition to analytical tools, Galaxy also contains plugins [12] for data visualization. Galaxy visualization plugins may be 3.0.0b35 0.4.0 5.0.0 0.2.01 0.2.01 0.1.2 0.5.1.1 (...truncated)


This is a preview of a remote PDF: https://academic.oup.com/gigascience/article-pdf/7/3/giy005/24622312/giy005.pdf
Article home page: https://academic.oup.com/gigascience/article/7/3/1/4841850

Thanki, Anil S, Soranzo, Nicola, Haerty, Wilfried, Davey, Robert P. GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline, GigaScience, 2018, Volume 7, Issue 3, DOI: 10.1093/gigascience/giy005