GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline
GigaScience, 7, 2018, 1–10
doi: 10.1093/gigascience/giy005
Advance Access Publication Date: 7 February 2018
Technical Note
TE C H N I C A L N O T E
Anil S. Thanki∗ , Nicola Soranzo, Wilfried Haerty and Robert P. Davey
Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
∗
Correspondence address. Anil S. Thanki, Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, UK. E-mail:
Abstract
Background: Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of
gene families has often been associated with morphological, physiological, and environmental adaptations. The study of
homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestral gene
duplication events as well as identifying genes that have diverged from a common ancestor under positive selection. There
are various tools available, such as MSOAR, OrthoMCL, and HomoloGene, to identify gene families and visualize syntenic
information between species, providing an overview of syntenic regions evolution at the family level. Unfortunately, none
of them provide information about structural changes within genes, such as the conservation of ancestral exon boundaries
among multiple genomes. The Ensembl GeneTrees computational pipeline generates gene trees based on coding
sequences, provides details about exon conservation, and is used in the Ensembl Compara project to discover gene families.
Findings: A certain amount of expertise is required to configure and run the Ensembl Compara GeneTrees pipeline via
command line. Therefore, we converted this pipeline into a Galaxy workflow, called GeneSeqToFamily, and provided
additional functionality. This workflow uses existing tools from the Galaxy ToolShed, as well as providing additional
wrappers and tools that are required to run the workflow. Conclusions: GeneSeqToFamily represents the Ensembl
GeneTrees pipeline as a set of interconnected Galaxy tools, so they can be run interactively within the Galaxy’s
user-friendly workflow environment while still providing the flexibility to tailor the analysis by changing configurations and
tools if necessary. Additional tools allow users to subsequently visualize the gene families produced by the workflow, using
the Aequatus.js interactive tool, which has been developed as part of the Aequatus software project.
Keywords: Galaxy; Pipeline; Workflow; Genomics; Comparative Genomics; Homology; Orthology; Paralogy; Phylogeny; Gene
Family; Alignment; Compara; Ensembl
Introduction
The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene
families (also referred to as “orthogroups”) that comprise genes
sharing common descent [1]. This plays a vital role in finding
ancestral gene duplication events as well as in identifying regions under positive selection within species [2]. In order to
investigate these low-level comparisons between gene families, the Ensembl Compara GeneTrees gene orthology and paralogy prediction software suite [3] was developed as a pipeline.
The Ensembl GeneTrees pipeline uses TreeBest [4, 5] (part of
Received: 30 March 2017; Revised: 31 July 2017; Accepted: 18 January 2018
C The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.
1
GeneSeqToFamily: a Galaxy workflow to find gene
families based on the Ensembl Compara GeneTrees
pipeline
2
Thanki et al.
Table 1: Galaxy tools used in the workflow
Developed at Earlham
Institute
Toolsheds
Tool ID
Version
Tool
Wrapper
reference
Get sequences by Ensembl ID
Get features by Ensembl ID
Select longest coding sequence
per gene
ETE species tree generator
GeneSeqToFamily preparation
Transeq
NCBI BLAST+ makeblastdb
NCBI BLAST+ blastp
BLAST parser
hcluster sg
hcluster sg parser
Filter by FASTA IDs
T-Coffee
Tranalign
TreeBeST best
Gene Alignment and Family
Aggregator
Unique
FASTA-to-Tabular
UniProt ID mapping and
retrieval
get sequences
get feature info
ensembl longest cds per gene
0.1.2
0.1.2
0.0.2
Yes
Yes
Yes
Yes
Yes
Yes
[17]
[18]
[19]
Yes
Yes
No
No
No
Yes
No
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
[20]
[21]
[22]
[23]
[23]
[24]
[25]
[26]
[27]
[28]
[22]
[29]
[30]
No
No
No
No
No
No
[31]
[32]
[33]
ete species tree generator
gstf preparation
EMBOSS: transeq101
ncbi makeblastdb
ncbi blastp wrapper
blast parser
hcluster sg
hcluster sg parser
filter by fasta ids
t coffee
EMBOSS: tranalign100
treebest best
gafa
tp sorted uniq
fasta2tab
uniprot rest interface
TreeFam [6]), which implements multiple independent phylogenetic methods and can merge the results into a consensus tree
while trying to minimize duplications and deletions relative to
a known species tree. This allows TreeBeST to take advantage
of the fact that DNA-based methods are often more accurate for
closely related parts of trees, while protein-based trees are better at longer evolutionary distances.
The Ensembl GeneTrees pipeline comprises 7 steps, starting from a set of protein sequences and performing similarity
searching and multiple large-scale alignments to infer homology among them, using various tools: BLAST [7], hcluster sg [8],
T-Coffee [9], and phylogenetic tree construction tools, including
TreeBeST. While these tools are freely available, most are specific to certain computing environments, are only usable via the
command line, and require many dependencies to be fulfilled.
Therefore, users are not always sufficiently expert in system administration to install, run, and debug the various tools at each
stage in a chain of processes. To help ease the complexity of
running the GeneTrees pipeline, we employed the Galaxy bioinformatics analysis platform to relieve the burden of managing
these system-level challenges.
Galaxy is an open-source framework for running a broad
collection of bioinformatics tools via a user-friendly web interface [10]. No client software is required other than a recent web
browser, and users are able to run tools singly or aggregated
into interconnected pipelines, called “workflows”. Galaxy enables users to not only create but also share workflows with the
community. In this way, it helps users who have little or no bioinformatics expertise to run potentially complex pipelines in order
to analyze their own data and interrogate results within a single online platform. Furthermore, pipelines can be published in
a scientific paper or in a repository such as myExperiment [11]
to encourage transparency and reproducibility.
In addition to analytical tools, Galaxy also contains plugins
[12] for data visualization. Galaxy visualization plugins may be
3.0.0b35
0.4.0
5.0.0
0.2.01
0.2.01
0.1.2
0.5.1.1 (...truncated)