Conveyor: a workflow engine for bioinformatic analyses
Burkhard Linke
1
Robert Giegerich
0
Alexander Goesmann
1
Associate Editor: Alex Bateman
0
Faculty of Technology, Bielefeld University
, 33615 Bielefeld,
Germany
1
Bioinformatics Resource Faciliy, Center for Biotechnology
Motivation: The rapidly increasing amounts of data available from new high-throughput methods have made data processing without automated pipelines infeasible. As was pointed out in several publications, integration of data and analytic resources into workflow systems provides a solution to this problem, simplifying the task of data analysis. Various applications for defining and running workflows in the field of bioinformatics have been proposed and published, e.g. Galaxy, Mobyle, Taverna, Pegasus or Kepler. One of the main aims of such workflow systems is to enable scientists to focus on analysing their datasets instead of taking care for data management, job management or monitoring the execution of computational tasks. The currently available workflow systems achieve this goal, but fundamentally differ in their way of executing workflows. Results: We have developed the Conveyor software library, a multitiered generic workflow engine for composition, execution and monitoring of complex workflows. It features an open, extensible system architecture and concurrent program execution to exploit resources available on modern multicore CPU hardware. It offers the ability to build complex workflows with branches, loops and other control structures. Two example use cases illustrate the application of the versatile Conveyor engine to common bioinformatics problems. Availability: The Conveyor application including client and server are available at http://conveyor.cebitec.uni-bielefeld.de. Contact: ; . Supplementary information: Supplementary data are available at Bioinformatics online. The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email:
1 INTRODUCTION
Workflows have become an important aspect in the field of
bioinformatics during the last years (e.g. Romano, 2007 and
Smedley et al., 2008). Applications like Galaxy (Goecks et al.,
2010), Taverna (Hull et al., 2006), Pegasus (Deelman et al., 2005)
and Kepler (Altintas et al., 2004) offer an easy way to access
local and remote resources and perform automatic analyses to test
hypotheses or process data. In many cases, they have become
a reasonable alternative to write simple software tools like Perl
scripts, especially for users without an in-depth computer science
background. Libraries like Ruffus (Goodstadt, 2010) add workflow
functionality to programming languages, providing methods to
define and build workflow within own applications.
A workflow is built from several linked steps that consume inputs,
process and convert data and produce results. The most simple
workflows are linear chains of steps to convert input data to the
required output. More complex setups may also include loops,
branches, parallel and conditional processing, reading from various
sources and writing different outputs in various formats.
For creating reusable workflow components, a processing step
may be composed of nested processing steps, allowing the user to
build complex pipelines for higher level analysis.
Many workflow engines act as a wrapper using existing command
line utilities or enact and orchestrate existing web services as
basic modules for their processing steps. As a result, adding new
processing steps by wrapping existing applications often does only
require little or no programming effort.
Of course, this comes at a price. Passing data between processing
steps depends on a common data format, especially in the case
of distributed processing nodes. Most analysis tools available as
command line applications or web services are consuming simple
text formats, e.g. the FASTA format for DNA and amino acid
sequences. These formats are in turn used by the workflow engines
to exchange data between processing nodes. Integrating other
processing nodes or input sources requires explicit data conversion
prior to processing. This often leads to the loss of information;
e.g. gene features annotated in EMBL or GenBank entries cannot
retain all qualifiers after converting them to FASTA format. An
analysis done by Wassink et al. (2009) shows that most tasks
used in publicly available Taverna workflows are dedicated to
data conversion. To some extent, this problem is solved by meta
information provided with types that allow the definition of type
hierarchies and interfaces. The BioMoby (Wilkinson et al., 2008)
data type management is an example for an ontology-based approach
to data type handling; nonetheless, it requires extra efforts by the
developer and/or maintainer. Other attempts to define common data
types like BioXSD (Kala et al., 2010) or Seibel et al. (2006)
were made, but none of them has been successfully adopted by
the community yet. The situation is even worse if legacy data from
applications are to be integrated into a workflow. Accessing data, e.g.
stored in a relational database or available by a local application
only, requires special processing steps; passing the data between
distributed processing nodes may not be possible at all.
Another problem arises from the nature of web services used in
processing steps. Although they offer an elegant and easy way to
provide and consume useful services, users have to be aware of
the pitfalls of web services if they rely on them for an analytical
workflow. A service may become unavailable without prior notice
of the provider. For plain web services, only the syntax of input and
output data types is defined, for example by using a WSDL file. They
completely lack information about semantics and whether a data type
is compatible to another type used in a different service. Passing
large amounts of data to or from web services adds a noticeable
processing overhead. Last but not least, external web services should
not be used with confidential data, since the exact service provider
and the means of transferring data (unsecure/secured by https) are
unknown in many workflow systems. In a scenario working with
confidential data, all services used in a workflow system should be
deployed in the internal network only.
Workflow systems also attempt to offer an alternative way to
run more complex analyses and thus replace small software tools,
e.g. Perl scripts. Controlling the flow of data in workflows is
essential for these applications. This includes branching the flow
and using alternative processing based on intermediate results, and
enabling/disabling complete processing branches for reusability of
workflows.
With high-throughput methods producing more and more data,
using web service or grid-based solutions introduces additional
bottlenecks, and service providers will likely get into problems
providing the necessary resources. A possible solution could be
the integration of local compute infrastructures into the workfl (...truncated)