Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis
Cormier et al. BMC Bioinformatics (2016) 17:270
DOI 10.1186/s12859-016-1125-3
SOFTWARE
Open Access
Reusable, extensible, and modifiable R
scripts and Kepler workflows for
comprehensive single set ChIP-seq analysis
Nathan Cormier†, Tyler Kolisnik† and Mark Bieda*
Abstract
Background: There has been an enormous expansion of use of chromatin immunoprecipitation followed by
sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and
production of several specialized graphical outputs. A number of systems have emphasized custom development
of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or
supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq
datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata
tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if
these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily
comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in
this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking.
Results: We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system;
most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey
pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking.
Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak
finding), along with custom programming. This software presents comprehensive solutions and easily repurposed
code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS,
summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene
ontology, pathway analysis, and de novo motif finding, among others.
Conclusions: These pipelines range from those performing a single task to those performing full analyses of
ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in
the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and
repurposing.
Keywords: Scientific workflows, ChIP-seq analysis, Software packages, Bioconductor
Background
Chromatin immunoprecipitation followed by sequencing
(ChIP-seq) is a standard approach for localizing proteins
bound to DNA, usually transcription factors or histones,
including modified histones. The rapidly decreasing cost
of sequencing has led to an explosion in the number of
* Correspondence:
†
Equal contributors
Department of Biochemistry and Molecular Biology, University of Calgary
Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary,
AB T2N4N1, Canada
ChIP-seq datasets. This is the standard approach used
by the large scale ENCODE [1] and modENCODE projects [2].
A comprehensive ChIP-seq analysis is complex and
consists of many steps. The steps involved in basic
ChIP-seq analysis have been discussed previously [3].
Here, we focus on developing pipelines for analyzing single experiments (optionally with a matched control
track). Briefly, first the sequence reads are aligned to a
reference genome, then peaks are predicted, and finally a
rich analysis of the peak data follows. The analysis of the
© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Cormier et al. BMC Bioinformatics (2016) 17:270
peaks can be quite complex, encompassing several distinct and independent functions, ranging from motif
analysis to pathway analysis. For a full analysis of ChIPseq data, it is desirable to have a range of outputs from a
pipeline, including informative plots to visualize the
data.
Generally, ChIP-seq analysis represents an area of
complex, multistep data analysis with continuing evolution of analysis options and goals. Under these conditions, the virtues of modifiability, extensibility, and
comprehensibility leading to easily reproducible research
become important [4]. Modifiability and extensibility are
important due to changes in analysis goals (e.g., different
types of graphical output) and changes in analysis methodologies or addition of new methodologies (e.g.,
addition of pathway analyses). With this evolution of
pipelines over time, it becomes important to have a software design approach that promotes comprehensibility,
because external, written descriptions of functionality
can quickly become outdated. Finally, a central scientific
value is reproducibility of research results. For complex
computational analyses, replication has emerged as a difficult issue for several reasons, as discussed in [5]. Reproducible scientific analyses are supported by systems
that feature straightforward distribution of the software
and clear display of input values (input parameters). Virtues of various systems to enable reproducible and easily
understood analyses have recently been described [6].
There has been significant pipeline development previously in this area and the generation of a large series of
tools. Table 1 compares our software to current ChIPseq analysis packages. This table indicates functions
from the perspective of single set ChIP-seq analysis,
which is the goal of our pipelines. Importantly, several of
the other pipelines provide support for other types of
analyses, such as cross-dataset comparisons, crossspecies comparisons, ChIP-chip vs ChIP-seq comparisons, and integration of gene expression microarray information. We do not include in the comparison some
other pipelines that are oriented toward different tasks,
as these pipelines lack most of the functions listed; seqminer [7], according to the authors, is oriented toward
analysis based on predefined genomic regions; the ENCODE pipeline [8] includes no downstream analysis;
and chipseq [9] includes minimal downstream analysis.
We also do not include Sole-Search [10], as this package
does not appear to be available currently. Examination
of Table 1 indicates that one major difference is that our
software provides both complete “turnkey” pipelines and
also a set o (...truncated)