Conveyor: a workflow engine for bioinformatic analyses
BIOINFORMATICS
ORIGINAL PAPER
Genome analysis
Vol. 27 no. 7 2011, pages 903–911
doi:10.1093/bioinformatics/btr040
Advance Access publication January 28, 2011
Conveyor: a workflow engine for bioinformatic analyses
Burkhard Linke1,∗ , Robert Giegerich2 and Alexander Goesmann1
1 Bioinformatics
Resource Faciliy, Center for Biotechnology and 2 Faculty of Technology, Bielefeld University, 33615
Bielefeld, Germany
Associate Editor: Alex Bateman
Received on October 26, 2010; revised on January 7, 2011; accepted
on January 19, 2011
1
INTRODUCTION
Workflows have become an important aspect in the field of
bioinformatics during the last years (e.g. Romano, 2007 and
Smedley et al., 2008). Applications like Galaxy (Goecks et al.,
2010), Taverna (Hull et al., 2006), Pegasus (Deelman et al., 2005)
and Kepler (Altintas et al., 2004) offer an easy way to access
local and remote resources and perform automatic analyses to test
hypotheses or process data. In many cases, they have become
a reasonable alternative to write simple software tools like Perl
scripts, especially for users without an in-depth computer science
background. Libraries like Ruffus (Goodstadt, 2010) add workflow
∗ To
whom correspondence should be addressed.
functionality to programming languages, providing methods to
define and build workflow within own applications.
A workflow is built from several linked steps that consume inputs,
process and convert data and produce results. The most simple
workflows are linear chains of steps to convert input data to the
required output. More complex setups may also include loops,
branches, parallel and conditional processing, reading from various
sources and writing different outputs in various formats.
For creating reusable workflow components, a processing step
may be composed of nested processing steps, allowing the user to
build complex pipelines for higher level analysis.
Many workflow engines act as a wrapper using existing command
line utilities or enact and orchestrate existing web services as
basic modules for their processing steps. As a result, adding new
processing steps by wrapping existing applications often does only
require little or no programming effort.
Of course, this comes at a price. Passing data between processing
steps depends on a common data format, especially in the case
of distributed processing nodes. Most analysis tools available as
command line applications or web services are consuming simple
text formats, e.g. the FASTA format for DNA and amino acid
sequences. These formats are in turn used by the workflow engines
to exchange data between processing nodes. Integrating other
processing nodes or input sources requires explicit data conversion
prior to processing. This often leads to the loss of information;
e.g. gene features annotated in EMBL or GenBank entries cannot
retain all qualifiers after converting them to FASTA format. An
analysis done by Wassink et al. (2009) shows that most tasks
used in publicly available Taverna workflows are dedicated to
data conversion. To some extent, this problem is solved by meta
information provided with types that allow the definition of type
hierarchies and interfaces. The BioMoby (Wilkinson et al., 2008)
data type management is an example for an ontology-based approach
to data type handling; nonetheless, it requires extra efforts by the
developer and/or maintainer. Other attempts to define common data
types like BioXSD (Kalaš et al., 2010) or Seibel et al. (2006)
were made, but none of them has been successfully adopted by
the community yet. The situation is even worse if legacy data from
applications are to be integrated into a workflow. Accessing data, e.g.
stored in a relational database or available by a local application
only, requires special processing steps; passing the data between
distributed processing nodes may not be possible at all.
Another problem arises from the nature of web services used in
processing steps. Although they offer an elegant and easy way to
provide and consume useful services, users have to be aware of
the pitfalls of web services if they rely on them for an analytical
workflow. A service may become unavailable without prior notice
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[13:07 17/3/2011 Bioinformatics-btr040.tex]
ABSTRACT
Motivation: The rapidly increasing amounts of data available from
new high-throughput methods have made data processing without
automated pipelines infeasible. As was pointed out in several
publications, integration of data and analytic resources into workflow
systems provides a solution to this problem, simplifying the task
of data analysis. Various applications for defining and running
workflows in the field of bioinformatics have been proposed and
published, e.g. Galaxy, Mobyle, Taverna, Pegasus or Kepler. One
of the main aims of such workflow systems is to enable scientists
to focus on analysing their datasets instead of taking care for
data management, job management or monitoring the execution
of computational tasks. The currently available workflow systems
achieve this goal, but fundamentally differ in their way of executing
workflows.
Results: We have developed the Conveyor software library, a
multitiered generic workflow engine for composition, execution and
monitoring of complex workflows. It features an open, extensible
system architecture and concurrent program execution to exploit
resources available on modern multicore CPU hardware. It offers the
ability to build complex workflows with branches, loops and other
control structures. Two example use cases illustrate the application of
the versatile Conveyor engine to common bioinformatics problems.
Availability: The Conveyor application including client and server are
available at http://conveyor.cebitec.uni-bielefeld.de.
Contact: ; .
Supplementary information: Supplementary data are available at
Bioinformatics online.
903
Page: 903
903–911
B.Linke et al.
2
APPROACH AND DESIGN
To overcome these limitations, we have developed Conveyor
as a novel software library offering its functionality to other
applications. A client-server setup based on the library is used to
separate the design of a workflow from its execution. Additionally,
a command line tool allows to execute workflows in a batch
manner. As a software library, Conveyor may also be included into
other applications, providing an easy to update and maintain data
processing layer.
Workflows in Conveyor are represented by directed graphs,
composed of nodes for processing and input/output steps, and edges
moving data between nodes. This design allows simple pipelines
built from concatenated processing steps, and also complex graphs
with branches, loops and parallel flow of data. Data are passed one
by one between nodes, creating a stream of input data. This model
enables the parallel processing of nodes, using multiple CPU cores
if available. For a first impressio (...truncated)