Conveyor: a workflow engine for bioinformatic analyses (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/27/7/903/48868779/bioinformatics_27_7_903.pdf

Conveyor: a workflow engine for bioinformatic analyses

BIOINFORMATICS ORIGINAL PAPER Genome analysis Vol. 27 no. 7 2011, pages 903–911 doi:10.1093/bioinformatics/btr040 Advance Access publication January 28, 2011 Conveyor: a workﬂow engine for bioinformatic analyses Burkhard Linke1,∗ , Robert Giegerich2 and Alexander Goesmann1 1 Bioinformatics Resource Faciliy, Center for Biotechnology and 2 Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany Associate Editor: Alex Bateman Received on October 26, 2010; revised on January 7, 2011; accepted on January 19, 2011 1 INTRODUCTION Workflows have become an important aspect in the field of bioinformatics during the last years (e.g. Romano, 2007 and Smedley et al., 2008). Applications like Galaxy (Goecks et al., 2010), Taverna (Hull et al., 2006), Pegasus (Deelman et al., 2005) and Kepler (Altintas et al., 2004) offer an easy way to access local and remote resources and perform automatic analyses to test hypotheses or process data. In many cases, they have become a reasonable alternative to write simple software tools like Perl scripts, especially for users without an in-depth computer science background. Libraries like Ruffus (Goodstadt, 2010) add workflow ∗ To whom correspondence should be addressed. functionality to programming languages, providing methods to define and build workflow within own applications. A workflow is built from several linked steps that consume inputs, process and convert data and produce results. The most simple workflows are linear chains of steps to convert input data to the required output. More complex setups may also include loops, branches, parallel and conditional processing, reading from various sources and writing different outputs in various formats. For creating reusable workflow components, a processing step may be composed of nested processing steps, allowing the user to build complex pipelines for higher level analysis. Many workflow engines act as a wrapper using existing command line utilities or enact and orchestrate existing web services as basic modules for their processing steps. As a result, adding new processing steps by wrapping existing applications often does only require little or no programming effort. Of course, this comes at a price. Passing data between processing steps depends on a common data format, especially in the case of distributed processing nodes. Most analysis tools available as command line applications or web services are consuming simple text formats, e.g. the FASTA format for DNA and amino acid sequences. These formats are in turn used by the workflow engines to exchange data between processing nodes. Integrating other processing nodes or input sources requires explicit data conversion prior to processing. This often leads to the loss of information; e.g. gene features annotated in EMBL or GenBank entries cannot retain all qualifiers after converting them to FASTA format. An analysis done by Wassink et al. (2009) shows that most tasks used in publicly available Taverna workflows are dedicated to data conversion. To some extent, this problem is solved by meta information provided with types that allow the definition of type hierarchies and interfaces. The BioMoby (Wilkinson et al., 2008) data type management is an example for an ontology-based approach to data type handling; nonetheless, it requires extra efforts by the developer and/or maintainer. Other attempts to define common data types like BioXSD (Kalaš et al., 2010) or Seibel et al. (2006) were made, but none of them has been successfully adopted by the community yet. The situation is even worse if legacy data from applications are to be integrated into a workflow. Accessing data, e.g. stored in a relational database or available by a local application only, requires special processing steps; passing the data between distributed processing nodes may not be possible at all. Another problem arises from the nature of web services used in processing steps. Although they offer an elegant and easy way to provide and consume useful services, users have to be aware of the pitfalls of web services if they rely on them for an analytical workflow. A service may become unavailable without prior notice © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [13:07 17/3/2011 Bioinformatics-btr040.tex] ABSTRACT Motivation: The rapidly increasing amounts of data available from new high-throughput methods have made data processing without automated pipelines infeasible. As was pointed out in several publications, integration of data and analytic resources into workﬂow systems provides a solution to this problem, simplifying the task of data analysis. Various applications for deﬁning and running workﬂows in the ﬁeld of bioinformatics have been proposed and published, e.g. Galaxy, Mobyle, Taverna, Pegasus or Kepler. One of the main aims of such workﬂow systems is to enable scientists to focus on analysing their datasets instead of taking care for data management, job management or monitoring the execution of computational tasks. The currently available workﬂow systems achieve this goal, but fundamentally differ in their way of executing workﬂows. Results: We have developed the Conveyor software library, a multitiered generic workﬂow engine for composition, execution and monitoring of complex workﬂows. It features an open, extensible system architecture and concurrent program execution to exploit resources available on modern multicore CPU hardware. It offers the ability to build complex workﬂows with branches, loops and other control structures. Two example use cases illustrate the application of the versatile Conveyor engine to common bioinformatics problems. Availability: The Conveyor application including client and server are available at http://conveyor.cebitec.uni-bielefeld.de. Contact: ; . Supplementary information: Supplementary data are available at Bioinformatics online. 903 Page: 903 903–911 B.Linke et al. 2 APPROACH AND DESIGN To overcome these limitations, we have developed Conveyor as a novel software library offering its functionality to other applications. A client-server setup based on the library is used to separate the design of a workflow from its execution. Additionally, a command line tool allows to execute workflows in a batch manner. As a software library, Conveyor may also be included into other applications, providing an easy to update and maintain data processing layer. Workflows in Conveyor are represented by directed graphs, composed of nodes for processing and input/output steps, and edges moving data between nodes. This design allows simple pipelines built from concatenated processing steps, and also complex graphs with branches, loops and parallel flow of data. Data are passed one by one between nodes, creating a stream of input data. This model enables the parallel processing of nodes, using multiple CPU cores if available. For a first impressio (...truncated)