Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/s12859-016-1125-3.pdf

Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis

Cormier et al. BMC Bioinformatics (2016) 17:270 DOI 10.1186/s12859-016-1125-3 SOFTWARE Open Access Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis Nathan Cormier†, Tyler Kolisnik† and Mark Bieda* Abstract Background: There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking. Results: We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others. Conclusions: These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing. Keywords: Scientific workflows, ChIP-seq analysis, Software packages, Bioconductor Background Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a standard approach for localizing proteins bound to DNA, usually transcription factors or histones, including modified histones. The rapidly decreasing cost of sequencing has led to an explosion in the number of * Correspondence: † Equal contributors Department of Biochemistry and Molecular Biology, University of Calgary Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary, AB T2N4N1, Canada ChIP-seq datasets. This is the standard approach used by the large scale ENCODE [1] and modENCODE projects [2]. A comprehensive ChIP-seq analysis is complex and consists of many steps. The steps involved in basic ChIP-seq analysis have been discussed previously [3]. Here, we focus on developing pipelines for analyzing single experiments (optionally with a matched control track). Briefly, first the sequence reads are aligned to a reference genome, then peaks are predicted, and finally a rich analysis of the peak data follows. The analysis of the © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Cormier et al. BMC Bioinformatics (2016) 17:270 peaks can be quite complex, encompassing several distinct and independent functions, ranging from motif analysis to pathway analysis. For a full analysis of ChIPseq data, it is desirable to have a range of outputs from a pipeline, including informative plots to visualize the data. Generally, ChIP-seq analysis represents an area of complex, multistep data analysis with continuing evolution of analysis options and goals. Under these conditions, the virtues of modifiability, extensibility, and comprehensibility leading to easily reproducible research become important [4]. Modifiability and extensibility are important due to changes in analysis goals (e.g., different types of graphical output) and changes in analysis methodologies or addition of new methodologies (e.g., addition of pathway analyses). With this evolution of pipelines over time, it becomes important to have a software design approach that promotes comprehensibility, because external, written descriptions of functionality can quickly become outdated. Finally, a central scientific value is reproducibility of research results. For complex computational analyses, replication has emerged as a difficult issue for several reasons, as discussed in [5]. Reproducible scientific analyses are supported by systems that feature straightforward distribution of the software and clear display of input values (input parameters). Virtues of various systems to enable reproducible and easily understood analyses have recently been described [6]. There has been significant pipeline development previously in this area and the generation of a large series of tools. Table 1 compares our software to current ChIPseq analysis packages. This table indicates functions from the perspective of single set ChIP-seq analysis, which is the goal of our pipelines. Importantly, several of the other pipelines provide support for other types of analyses, such as cross-dataset comparisons, crossspecies comparisons, ChIP-chip vs ChIP-seq comparisons, and integration of gene expression microarray information. We do not include in the comparison some other pipelines that are oriented toward different tasks, as these pipelines lack most of the functions listed; seqminer [7], according to the authors, is oriented toward analysis based on predefined genomic regions; the ENCODE pipeline [8] includes no downstream analysis; and chipseq [9] includes minimal downstream analysis. We also do not include Sole-Search [10], as this package does not appear to be available currently. Examination of Table 1 indicates that one major difference is that our software provides both complete “turnkey” pipelines and also a set o (...truncated)