Wrangling Galaxy’s reference data
Daniel Blankenberg
1
2
James E. Johnson
0
The Galaxy Team
1
James Taylor
1
3
4
Anton Nekrutenko
1
2
Associate Editor: John Hancock
0
Minnesota Supercomputing Institute, University of Minnesota
,
Minneapolis, MN 55455, USA
1
http://www.galaxyproject.org
2
Department of Biochemistry and Molecular Biology, Penn State University, University Park
,
PA 16802, USA
3
Department of Mathematics and Computer Science, Emory University
,
Atlanta, GA 30322, USA
4
Department of Biology
Summary: The Galaxy platform has developed into a fully featured collaborative workbench, with goals of inherently capturing provenance to enable reproducible data analysis, and of making it straightforward to run one's own server. However, many Galaxy platform tools rely on the presence of reference data, such as alignment indexes, to function efficiently. Until now, the building of this cache of data for Galaxy has been an error-prone manual process lacking reproducibility and provenance. The Galaxy Data Manager framework is an enhancement that changes the management of Galaxy's built-in data cache from a manual procedure to an automated graphical user interface (GUI) driven process, which contains the same openness, reproducibility and provenance that is afforded to Galaxy's analysis tools. Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of reference data in real time. Availability and implementation: The Galaxy Data Manager framework is implemented in Python and has been integrated as part of the core Galaxy platform. Individual Data Manager tools can be defined locally or installed from a ToolShed, allowing the Galaxy community to define additional Data Manager tools as needed, with full versioning and dependency support. Contact: . or Supplementary information: Supplementary data is available at Bioinformatics online.
1 INTRODUCTION
Galaxy (Blankenberg et al., 2010; Giardine et al., 2005; Goecks
et al., 2010) is a web-based platform for performing large-scale
data analysis. It is a completely open-source project that
supports accessible, reproducible and transparent computational
research and is available through the use of free public servers,
private local installations and by launching instances in the
Cloud. At the heart of Galaxy is its ability to integrate disparate
data sources and analysis tools into a unified interface. Galaxy
comes prepackaged with a default set of analysis tools, but
additional tools can be defined locally or installed from a
*To whom correspondence should be addressed.
community-curated resource known as the Galaxy ToolShed
(https://usegalaxy.org/toolshed). When a tool is executed within
Galaxy, all of the users selections and parameters are recorded,
providing provenance and enabling reproducible data analysis.
When executing a tool installed from the Galaxy ToolShed, not
only are input parameters recorded, but specific tool and
dependency versions are also controlled; this enables reproducibility
across time and between different Galaxy instances.
One weakness in this reproducibility is the reliance of many
tools on built-in reference data, such as reference genome
sequences or short-read mapper indexes (see Supplementary
Figure S1). Until now, Galaxy administrators have been
responsible for downloading, building and installing these important
reference data. For example, to make the UCSC hg19 build of
the human reference genome available to the Burrows-Wheeler
Aligner (BWA) short-read mapper (Li and Durbin, 2009), a
Galaxy administrator would need to (i) download the reference
genome FASTA file, (ii) make it available as a reference genome
via the all_fasta table (optional), (iii) build BWA alignment
indexes via proper command-line calls, (iv) register the location
and availability of the indexes within the bwa_indexes data
table (by adding an additional entry to the
tool-data/bwa_index.loc file on disk) and (v) finally, restart the Galaxy server.
Although not technically challenging, each one of the previously
mentioned manual steps is prone to error and lacks any
provenance; any incorrectness or incompleteness of built-in data will
have a severe impact on the correctness of a subsequent analysis.
Worse yet, it may not even be apparent that something has been
configured incorrectly, creating a situation where invalid results
are trusted.
Data Manager tools remove the technical burdens of ensuring
the reproducibility and provenance of built-in reference data
from the hands of the Galaxy administrator and make it an
automated point-and-click process. A new menu option,
Manage local data, has been added to the Galaxy administrator
interface. Accessing this option enables an administrator to run
Data Manager tools, inspect the results of individual Data
Manager executions and view the current state of Galaxys
built-in data registries. Running a Data Manager uses the
same familiar interface as a standard Galaxy tool, allowing an
administrator to configure the Data Manager with desired
options (e.g. dbkey, source reference FASTA file, indexing
algorithm). On completion of a Data Manager tool run, the Data
Manager framework parses the output for new data table entries
and values. These values are enabled in real time and persisted to
disk. Restarting the Galaxy server is not required for enabling
the new entries; however, the new entries will remain after a
restart. Although the Data Manager framework negates the
need for the manual curating of reference data, it is compatible
with any previously existing policy or process in-use for a Galaxy
installation.
A Galaxy Data Manager is composed of two primary components: a
Data Manager tool and a Data Manager configuration. Similar to
standard Galaxy tools, the Data Manager tool component is responsible for
defining the user configurable components as well as the command line
and scripts used to generate the actual underlying data (e.g. download a
FASTA genome, run the BWA binary to build index files). The Data
Manager configuration component instructs the framework on how to
process the output of the Data Manager tool into new entries into
Galaxys built-in data registry. Although a Data Manager can add any
number of new entries to any number of data tables, the most common
case is to add a single new entry to a single data table. Whereas the
examples listed here involve reference genomes or reference genome
indexes, it is worth noting that any type of preconfigured Galaxy data,
such as BLAST databases and protein or pathway domain databases, can
be incorporated into a Data Manager.
Data Manager tools
Data Manager tools were implemented as an extension to
standard Galaxy tools. A Galaxy tool can be loosely defined as being
composed of two parts: (i) an XML-based tool description that defines the
input parameters and settings, the manner in which to assemble a
command line to be executed, the output files generated by the command line
and on-screen hel (...truncated)