Conveyor: a workflow engine for bioinformatic analyses (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/27/7/903.full.pdf

Conveyor: a workflow engine for bioinformatic analyses

Burkhard Linke 1 Robert Giegerich 0 Alexander Goesmann 1 Associate Editor: Alex Bateman 0 Faculty of Technology, Bielefeld University , 33615 Bielefeld, Germany 1 Bioinformatics Resource Faciliy, Center for Biotechnology Motivation: The rapidly increasing amounts of data available from new high-throughput methods have made data processing without automated pipelines infeasible. As was pointed out in several publications, integration of data and analytic resources into workflow systems provides a solution to this problem, simplifying the task of data analysis. Various applications for defining and running workflows in the field of bioinformatics have been proposed and published, e.g. Galaxy, Mobyle, Taverna, Pegasus or Kepler. One of the main aims of such workflow systems is to enable scientists to focus on analysing their datasets instead of taking care for data management, job management or monitoring the execution of computational tasks. The currently available workflow systems achieve this goal, but fundamentally differ in their way of executing workflows. Results: We have developed the Conveyor software library, a multitiered generic workflow engine for composition, execution and monitoring of complex workflows. It features an open, extensible system architecture and concurrent program execution to exploit resources available on modern multicore CPU hardware. It offers the ability to build complex workflows with branches, loops and other control structures. Two example use cases illustrate the application of the versatile Conveyor engine to common bioinformatics problems. Availability: The Conveyor application including client and server are available at http://conveyor.cebitec.uni-bielefeld.de. Contact: ; . Supplementary information: Supplementary data are available at Bioinformatics online. The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: 1 INTRODUCTION Workflows have become an important aspect in the field of bioinformatics during the last years (e.g. Romano, 2007 and Smedley et al., 2008). Applications like Galaxy (Goecks et al., 2010), Taverna (Hull et al., 2006), Pegasus (Deelman et al., 2005) and Kepler (Altintas et al., 2004) offer an easy way to access local and remote resources and perform automatic analyses to test hypotheses or process data. In many cases, they have become a reasonable alternative to write simple software tools like Perl scripts, especially for users without an in-depth computer science background. Libraries like Ruffus (Goodstadt, 2010) add workflow functionality to programming languages, providing methods to define and build workflow within own applications. A workflow is built from several linked steps that consume inputs, process and convert data and produce results. The most simple workflows are linear chains of steps to convert input data to the required output. More complex setups may also include loops, branches, parallel and conditional processing, reading from various sources and writing different outputs in various formats. For creating reusable workflow components, a processing step may be composed of nested processing steps, allowing the user to build complex pipelines for higher level analysis. Many workflow engines act as a wrapper using existing command line utilities or enact and orchestrate existing web services as basic modules for their processing steps. As a result, adding new processing steps by wrapping existing applications often does only require little or no programming effort. Of course, this comes at a price. Passing data between processing steps depends on a common data format, especially in the case of distributed processing nodes. Most analysis tools available as command line applications or web services are consuming simple text formats, e.g. the FASTA format for DNA and amino acid sequences. These formats are in turn used by the workflow engines to exchange data between processing nodes. Integrating other processing nodes or input sources requires explicit data conversion prior to processing. This often leads to the loss of information; e.g. gene features annotated in EMBL or GenBank entries cannot retain all qualifiers after converting them to FASTA format. An analysis done by Wassink et al. (2009) shows that most tasks used in publicly available Taverna workflows are dedicated to data conversion. To some extent, this problem is solved by meta information provided with types that allow the definition of type hierarchies and interfaces. The BioMoby (Wilkinson et al., 2008) data type management is an example for an ontology-based approach to data type handling; nonetheless, it requires extra efforts by the developer and/or maintainer. Other attempts to define common data types like BioXSD (Kala et al., 2010) or Seibel et al. (2006) were made, but none of them has been successfully adopted by the community yet. The situation is even worse if legacy data from applications are to be integrated into a workflow. Accessing data, e.g. stored in a relational database or available by a local application only, requires special processing steps; passing the data between distributed processing nodes may not be possible at all. Another problem arises from the nature of web services used in processing steps. Although they offer an elegant and easy way to provide and consume useful services, users have to be aware of the pitfalls of web services if they rely on them for an analytical workflow. A service may become unavailable without prior notice of the provider. For plain web services, only the syntax of input and output data types is defined, for example by using a WSDL file. They completely lack information about semantics and whether a data type is compatible to another type used in a different service. Passing large amounts of data to or from web services adds a noticeable processing overhead. Last but not least, external web services should not be used with confidential data, since the exact service provider and the means of transferring data (unsecure/secured by https) are unknown in many workflow systems. In a scenario working with confidential data, all services used in a workflow system should be deployed in the internal network only. Workflow systems also attempt to offer an alternative way to run more complex analyses and thus replace small software tools, e.g. Perl scripts. Controlling the flow of data in workflows is essential for these applications. This includes branching the flow and using alternative processing based on intermediate results, and enabling/disabling complete processing branches for reusability of workflows. With high-throughput methods producing more and more data, using web service or grid-based solutions introduces additional bottlenecks, and service providers will likely get into problems providing the necessary resources. A possible solution could be the integration of local compute infrastructures into the workfl (...truncated)