Wrangling Galaxy’s reference data (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/13/1917.full.pdf

Wrangling Galaxy’s reference data

Daniel Blankenberg 1 2 James E. Johnson 0 The Galaxy Team 1 James Taylor 1 3 4 Anton Nekrutenko 1 2 Associate Editor: John Hancock 0 Minnesota Supercomputing Institute, University of Minnesota , Minneapolis, MN 55455, USA 1 http://www.galaxyproject.org 2 Department of Biochemistry and Molecular Biology, Penn State University, University Park , PA 16802, USA 3 Department of Mathematics and Computer Science, Emory University , Atlanta, GA 30322, USA 4 Department of Biology Summary: The Galaxy platform has developed into a fully featured collaborative workbench, with goals of inherently capturing provenance to enable reproducible data analysis, and of making it straightforward to run one's own server. However, many Galaxy platform tools rely on the presence of reference data, such as alignment indexes, to function efficiently. Until now, the building of this cache of data for Galaxy has been an error-prone manual process lacking reproducibility and provenance. The Galaxy Data Manager framework is an enhancement that changes the management of Galaxy's built-in data cache from a manual procedure to an automated graphical user interface (GUI) driven process, which contains the same openness, reproducibility and provenance that is afforded to Galaxy's analysis tools. Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of reference data in real time. Availability and implementation: The Galaxy Data Manager framework is implemented in Python and has been integrated as part of the core Galaxy platform. Individual Data Manager tools can be defined locally or installed from a ToolShed, allowing the Galaxy community to define additional Data Manager tools as needed, with full versioning and dependency support. Contact: . or Supplementary information: Supplementary data is available at Bioinformatics online. 1 INTRODUCTION Galaxy (Blankenberg et al., 2010; Giardine et al., 2005; Goecks et al., 2010) is a web-based platform for performing large-scale data analysis. It is a completely open-source project that supports accessible, reproducible and transparent computational research and is available through the use of free public servers, private local installations and by launching instances in the Cloud. At the heart of Galaxy is its ability to integrate disparate data sources and analysis tools into a unified interface. Galaxy comes prepackaged with a default set of analysis tools, but additional tools can be defined locally or installed from a *To whom correspondence should be addressed. community-curated resource known as the Galaxy ToolShed (https://usegalaxy.org/toolshed). When a tool is executed within Galaxy, all of the users selections and parameters are recorded, providing provenance and enabling reproducible data analysis. When executing a tool installed from the Galaxy ToolShed, not only are input parameters recorded, but specific tool and dependency versions are also controlled; this enables reproducibility across time and between different Galaxy instances. One weakness in this reproducibility is the reliance of many tools on built-in reference data, such as reference genome sequences or short-read mapper indexes (see Supplementary Figure S1). Until now, Galaxy administrators have been responsible for downloading, building and installing these important reference data. For example, to make the UCSC hg19 build of the human reference genome available to the Burrows-Wheeler Aligner (BWA) short-read mapper (Li and Durbin, 2009), a Galaxy administrator would need to (i) download the reference genome FASTA file, (ii) make it available as a reference genome via the all_fasta table (optional), (iii) build BWA alignment indexes via proper command-line calls, (iv) register the location and availability of the indexes within the bwa_indexes data table (by adding an additional entry to the tool-data/bwa_index.loc file on disk) and (v) finally, restart the Galaxy server. Although not technically challenging, each one of the previously mentioned manual steps is prone to error and lacks any provenance; any incorrectness or incompleteness of built-in data will have a severe impact on the correctness of a subsequent analysis. Worse yet, it may not even be apparent that something has been configured incorrectly, creating a situation where invalid results are trusted. Data Manager tools remove the technical burdens of ensuring the reproducibility and provenance of built-in reference data from the hands of the Galaxy administrator and make it an automated point-and-click process. A new menu option, Manage local data, has been added to the Galaxy administrator interface. Accessing this option enables an administrator to run Data Manager tools, inspect the results of individual Data Manager executions and view the current state of Galaxys built-in data registries. Running a Data Manager uses the same familiar interface as a standard Galaxy tool, allowing an administrator to configure the Data Manager with desired options (e.g. dbkey, source reference FASTA file, indexing algorithm). On completion of a Data Manager tool run, the Data Manager framework parses the output for new data table entries and values. These values are enabled in real time and persisted to disk. Restarting the Galaxy server is not required for enabling the new entries; however, the new entries will remain after a restart. Although the Data Manager framework negates the need for the manual curating of reference data, it is compatible with any previously existing policy or process in-use for a Galaxy installation. A Galaxy Data Manager is composed of two primary components: a Data Manager tool and a Data Manager configuration. Similar to standard Galaxy tools, the Data Manager tool component is responsible for defining the user configurable components as well as the command line and scripts used to generate the actual underlying data (e.g. download a FASTA genome, run the BWA binary to build index files). The Data Manager configuration component instructs the framework on how to process the output of the Data Manager tool into new entries into Galaxys built-in data registry. Although a Data Manager can add any number of new entries to any number of data tables, the most common case is to add a single new entry to a single data table. Whereas the examples listed here involve reference genomes or reference genome indexes, it is worth noting that any type of preconfigured Galaxy data, such as BLAST databases and protein or pathway domain databases, can be incorporated into a Data Manager. Data Manager tools Data Manager tools were implemented as an extension to standard Galaxy tools. A Galaxy tool can be loosely defined as being composed of two parts: (i) an XML-based tool description that defines the input parameters and settings, the manner in which to assemble a command line to be executed, the output files generated by the command line and on-screen hel (...truncated)