Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-9-334.pdf

Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data

BMC Bioinformatics Software Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data Peter Li 2 Juan I Castrillo 1 Giles Velarde 2 Ingo Wassink 0 Stian Soiland- Reyes 5 Stuart Owen 5 David Withers 5 Tom Oinn 4 Matthew R Pocock 3 Carole A Goble 5 Stephen G Oliver 1 Douglas B Kell 2 0 Human Media Interaction group, Electrical Engineering , Mathematics and Computer Science , University of Twente , Drienerlolaan 5, 7500 AE, Enschede , The Netherlands 1 Department of Biochemistry , Sanger Building , University of Cambridge , 80 Tennis Court Road, Cambridge, CB2 1GA , UK 2 Manchester Centre for Integrative Systems Biology and School of Chemistry , Manchester Interdisciplinary Biocentre , University of Manchester , 131 Princess St, Manchester, M1 7DN , UK 3 School of Computing Science, University of Newcastle , NE1 7RU , UK 4 EMBL European Bioinformatics Institute , Hinxton, Cambridge, CB10 1SD , UK 5 School of Computer Science , Kilburn Building , University of Manchester , Oxford Road, Manchester, M13 9PL , UK Background: There has been a dramatic increase in the amount of quantitative data derived from the measurement of changes at different levels of biological complexity during the post-genomic era. However, there are a number of issues associated with the use of computational tools employed for the analysis of such data. For example, computational tools such as R and MATLAB require prior knowledge of their programming languages in order to implement statistical analyses on data. Combining two or more tools in an analysis may also be problematic since data may have to be manually copied and pasted between separate user interfaces for each tool. Furthermore, this transfer of data may require a reconciliation step in order for there to be interoperability between computational tools. Results: Developments in the Taverna workflow system have enabled pipelines to be constructed and enacted for generic and ad hoc analyses of quantitative data. Here, we present an example of such a workflow involving the statistical identification of differentially-expressed genes from microarray data followed by the annotation of their relationships to cellular processes. This workflow makes use of customised maxdBrowse web services, a system that allows Taverna to query and retrieve gene expression data from the maxdLoad2 microarray database. These data are then analysed by R to identify differentially-expressed genes using the Taverna RShell processor which has been developed for invoking this tool when it has been deployed as a service using the RServe library. In addition, the workflow uses Beanshell scripts to reconcile mismatches of data between services as well as to implement a form of user interaction for selecting subsets of microarray data for analysis as part of the workflow execution. A new plugin system in the Taverna - software architecture is demonstrated by the use of renderers for displaying PDF files and CSV formatted data within the Taverna workbench. Conclusion: Taverna can be used by data analysis experts as a generic tool for composing ad hoc analyses of quantitative data by combining the use of scripts written in the R programming language with tools exposed as services in workflows. When these workflows are shared with colleagues and the wider scientific community, they provide an approach for other scientists wanting to use tools such as R without having to learn the corresponding programming language to analyse their own data. Background The advent of the post-genomic era in biology has led to a dramatic increase in the amount of multi-dimensional, quantitative data that must be analysed by the bioinformatician. This is especially true in the case of genomescale analyses of the transcriptome, proteome and metabolome, particularly when such measurements have been made in parallel using high throughput technologies involving microarrays and mass spectrometry techniques [1,2]. Analyses of these data rely on the performance of in silico experiments, involving the inductive detection of patterns in the data to which some phenotypic significance can be attributed [3]. Such analyses usually rely on statistical testing and linking the results of these tests with information stored in biological databases to summarise and develop conclusions. For example, the analysis of gene expression data generated from microarray experiments consists of a number of steps. The process begins with the normalization and standardization of transcript data, followed by statistical evaluation, and finally, interpretation of the statistical results via the annotation of genes with information relating to their biological function [4]. There are a number of issues associated with the use of computational tools in the analysis of quantitative data. Firstly, learning how to use such tools for statistical analyses can require significant time and effort. This is especially true for mathematical tools such as MATLAB [5] and R [6] which require prior knowledge of their programming languages and the functions within them in order to implement statistical algorithms. Secondly, there is the overhead of transferring data between computational resources during each step of a data analysis pipeline which is made more difficult due to the inconsistent nature of the user interface to the tools. For example, a user may access R from the command line whilst the querying of online sequence databases is made through the use of a web browser. Piping the output of one resource to another will therefore require intermediate staging of the data so that they may be passed manually amongst multiple tools [7]. Thirdly, the interoperability of computational tools can be awkward due to the heterogeneity of data in bioinformatics. The output data provided by a database service may be incompatible as input to the next analysis service both in terms of its structure and its semantics. In these cases, data have to be reconciled by a transformation step in order for them to be consumable by the next service. In silico experiments on bioinformatics data may be realised as workflows consisting of a pre-defined series of tasks that are related to one another by the flow of data between them. Such workflows can be constructed and enacted using applications such as Kepler [8], Triana [9] and Pegasus [10] that automatically direct the flow of data between the information repositories and computational tools responsible for performing the tasks within an in silico experiment. These workflow systems enable the use of distributed resources which have been deployed using web services, a distributed computing architecture that uses existing Internet communication and data exchange standards to support interoperable application-to-application interaction over a network [11]. Web service-enabled resources p (...truncated)