A graph-based approach for designing extensible pipelines
A graph-based approach for designing
extensible pipelines
Rodrigues et al.
Rodrigues et al. BMC Bioinformatics 2012, 13:163
http://www.biomedcentral.com/1471-2105/13/163
Rodrigues et al. BMC Bioinformatics 2012, 13:163
http://www.biomedcentral.com/1471-2105/13/163
M E TH O DO LO G Y A RTI CLE
Open Access
A graph-based approach for designing
extensible pipelines
Maı́ra R Rodrigues* , Wagner CS Magalhães, Moara Machado and Eduardo Tarazona-Santos*
Abstract
Background: In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal
with the new tools and data formats that are constantly being developed. The traditional and simplest
implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead
to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time
consuming. Current approaches to pipeline development such as workflow management systems focus on analysis
tasks that are systematically repeated without significant changes in their course of execution, such as genome
annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a
different combination of steps.
Results: We propose a graph-based approach to implement extensible and low-maintenance pipelines that is
suitable for pipeline applications with multiple functionalities that require different combinations of steps in each
execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending
on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity
of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are
the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and
a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion
pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields
where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and
proteomics analyses. The project code, documentation and the Java executables are available under an open source
license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.
Conclusions: Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set
of tools on demand, depending on the functionality required. It also allows the implementation of extensible and
low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics
systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution
steps and combined functionalities. In the format conversion application, the automatic combination of conversion
tools increased both the number of possible conversions available to the user and the extensibility of the system to
allow for future updates with new file formats.
Background
In silico experiments are performed using a set of computer analysis and processing tools that are executed in a
specific order. To automate the execution of these tools,
they are usually organised in the form of a pipeline, so that
the output of one tool is automatically passed on as the
*Correspondence: ;
Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Av.
Antonio Carlos 6627, Pampulha, Caixa Postal 486, 31270-910, Belo Horizonte,
Brazil
input of the next tool. In such a process, it is helpful to
have tools that are designed in a way that guarantees the
interoperability of all execution steps. The interoperability ensures that the output of a tool is processed by the
subsequent tool even if the output format of the former
does not match the input format of the latter. Aside from
enabling task automation and data flow control, pipelines
may be particularly advantageous if they allow an increasing number of possible operations offered to the user by
combining different tools. For example, if we have four
© 2012 Rodrigues et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Rodrigues et al. BMC Bioinformatics 2012, 13:163
http://www.biomedcentral.com/1471-2105/13/163
analysis tools: Blast [1], that finds sequence similarities
for a DNA sequence; CLUSTALW [2], which aligns a set
of sequences from different species; PHYLIP [3], which
finds phylogenetic relationships from sequences of different species; and PAML [4,5], that infers sites under
positive selection from a set of closely related sequences.
In addition to their individual functionality, we can combine Blast, CLUSTALW and PHYLIP in a pipeline to find
possible phylogenetic relationships for a DNA sequence.
Alternatively, we can also compose a pipeline using Blast,
CLUSTALW and PAML to infer sites under positive selection. Because the output of Blast is not compatible with
the input of CLUSTALW, additional reformatting by ad
hoc scripts is required to ensure the interoperability of the
tools in the pipelines.
The traditional and simplest implementation of
pipelines involves hardcoding the execution steps into
programs or scripts. This approach leads to problems
when pipelines need to be expanded, because the addition
of new tools to such a pipeline is error prone and time
consuming. An experienced programmer is needed to
change the hard-coded steps of such pipelines to include
new tools in the pipeline while maintaining bug-free functioning. These problems are a major concern not only
for bioinformatics laboratories that want to continuously
update their pipelines with new software developments,
but also for those who want to consolidate open and
cooperative systems [6,7].
An additional level of flexibility may be achieved by
workflow management systems such as Taverna [8],
Galaxy [9] and Pegasus [10] that are well suited for analysis tasks that are systematically repeated without changes
in the course of execution, such as genome annotation
[11,12] and the tasks registered at the myExperiment
website [13]. Some workflow management systems also
support dynamic execution of workflows, such as Kepler
[14] and others [15], where dynamism occurs during the
mapping and execution phases of the workflow’s life cycle
[15] mainly for the instantiation of workflow components
based on a high-level workflow description and data type
compatibility verification. In these systems, the composition of the high-level workflow description is usually
left to the user (...truncated)