A graph-based approach for designing extensible pipelines (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-13-163.pdf

A graph-based approach for designing extensible pipelines

A graph-based approach for designing extensible pipelines Rodrigues et al. Rodrigues et al. BMC Bioinformatics 2012, 13:163 http://www.biomedcentral.com/1471-2105/13/163 Rodrigues et al. BMC Bioinformatics 2012, 13:163 http://www.biomedcentral.com/1471-2105/13/163 M E TH O DO LO G Y A RTI CLE Open Access A graph-based approach for designing extensible pipelines Maı́ra R Rodrigues* , Wagner CS Magalhães, Moara Machado and Eduardo Tarazona-Santos* Abstract Background: In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workﬂow management systems focus on analysis tasks that are systematically repeated without signiﬁcant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a diﬀerent combination of steps. Results: We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require diﬀerent combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the ﬁelds of population genetics and genetic epidemiology, but our approach is also helpful in other ﬁelds where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms. Conclusions: Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new ﬁle formats. Background In silico experiments are performed using a set of computer analysis and processing tools that are executed in a speciﬁc order. To automate the execution of these tools, they are usually organised in the form of a pipeline, so that the output of one tool is automatically passed on as the *Correspondence: ; Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Av. Antonio Carlos 6627, Pampulha, Caixa Postal 486, 31270-910, Belo Horizonte, Brazil input of the next tool. In such a process, it is helpful to have tools that are designed in a way that guarantees the interoperability of all execution steps. The interoperability ensures that the output of a tool is processed by the subsequent tool even if the output format of the former does not match the input format of the latter. Aside from enabling task automation and data ﬂow control, pipelines may be particularly advantageous if they allow an increasing number of possible operations oﬀered to the user by combining diﬀerent tools. For example, if we have four © 2012 Rodrigues et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Rodrigues et al. BMC Bioinformatics 2012, 13:163 http://www.biomedcentral.com/1471-2105/13/163 analysis tools: Blast [1], that ﬁnds sequence similarities for a DNA sequence; CLUSTALW [2], which aligns a set of sequences from diﬀerent species; PHYLIP [3], which ﬁnds phylogenetic relationships from sequences of different species; and PAML [4,5], that infers sites under positive selection from a set of closely related sequences. In addition to their individual functionality, we can combine Blast, CLUSTALW and PHYLIP in a pipeline to ﬁnd possible phylogenetic relationships for a DNA sequence. Alternatively, we can also compose a pipeline using Blast, CLUSTALW and PAML to infer sites under positive selection. Because the output of Blast is not compatible with the input of CLUSTALW, additional reformatting by ad hoc scripts is required to ensure the interoperability of the tools in the pipelines. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach leads to problems when pipelines need to be expanded, because the addition of new tools to such a pipeline is error prone and time consuming. An experienced programmer is needed to change the hard-coded steps of such pipelines to include new tools in the pipeline while maintaining bug-free functioning. These problems are a major concern not only for bioinformatics laboratories that want to continuously update their pipelines with new software developments, but also for those who want to consolidate open and cooperative systems [6,7]. An additional level of ﬂexibility may be achieved by workﬂow management systems such as Taverna [8], Galaxy [9] and Pegasus [10] that are well suited for analysis tasks that are systematically repeated without changes in the course of execution, such as genome annotation [11,12] and the tasks registered at the myExperiment website [13]. Some workﬂow management systems also support dynamic execution of workﬂows, such as Kepler [14] and others [15], where dynamism occurs during the mapping and execution phases of the workﬂow’s life cycle [15] mainly for the instantiation of workﬂow components based on a high-level workﬂow description and data type compatibility veriﬁcation. In these systems, the composition of the high-level workﬂow description is usually left to the user (...truncated)