Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks
David A Nix
1
Samir J Courdy
1
Kenneth M Boucher
0
0
Oncological Sciences, University of Utah
,
Salt Lake City, Utah, 84105
,
USA
1
Huntsman Cancer Institute, Departments of Research Informatics, University of Utah
,
Salt Lake City, Utah, 84105
,
USA
Background: High throughput signature sequencing holds many promises, one of which is the ready identification of in vivo transcription factor binding sites, histone modifications, changes in chromatin structure and patterns of DNA methylation across entire genomes. In these experiments, chromatin immunoprecipitation is used to enrich for particular DNA sequences of interest and signature sequencing is used to map the regions to the genome (ChIP-Seq). Elucidation of these sites of DNA-protein binding/modification are proving instrumental in reconstructing networks of gene regulation and chromatin remodelling that direct development, response to cellular perturbation, and neoplastic transformation. Results: Here we present a package of algorithms and software that makes use of control input data to reduce false positives and estimate confidence in ChIP-Seq peaks. Several different methods were compared using two simulated spike-in datasets. Use of control input data and a normalized difference score were found to more than double the recovery of ChIP-Seq peaks at a 5% false discovery rate (FDR). Moreover, both a binomial p-value/q-value and an empirical FDR were found to predict the true FDR within 2-3 fold and are more reliable estimators of confidence than a global Poisson p-value. These methods were then used to reanalyze Johnson et al.'s neuron-restrictive silencer factor (NRSF) ChIP-Seq data without relying on extensive qPCR validated NRSF sites and the presence of NRSF binding motifs for setting thresholds. Conclusion: The methods developed and tested here show considerable promise for reducing false positives and estimating confidence in ChIP-Seq data without any prior knowledge of the chIP target. They are part of a larger open source package freely available from http:// useq.sourceforge.net/.
-
Background
Chromatin immunoprecipitation (chIP) is a
well-characterized technique for enriching regions of DNA that are
marked with a modification (e.g. methylation), display a
particular structure (e.g. DNase hypersensitivity), or are
bound by a protein (e.g. transcription factor, polymerase,
modified histone), in vivo, across an entire genome [1].
Chromatin is typically prepared by fixing live cells with a
DNA-protein cross-linker, lysing the cells, and randomly
fragmenting the DNA. An antibody that selectively binds
the target of interest is then used to immunoprecipitate
the target and any associated nucleic acid. The cross-linker
is then reversed and DNA fragments of approximately
200500 bp in size are isolated. The final chIP DNA
sample contains primarily background input DNA plus a
small amount (<1%) of additional immunoprecipitated
target DNA.
Several methods have been used to identify sequences
enriched in chIP samples (e.g. SAGE, ChIP-PET,
ChIPchip [2-4]). One of the most recent utilizes high
throughput signature sequencing to sequence the ends of a
portion of the DNA fragments in the chIP sample. In a typical
ChIP-Seq experiment, millions of short (e.g. 26 bp)
sequences are read from the ends of the chIP DNA. The
reads are mapped to a reference genome and enriched
regions identified by looking for locations with a
'significant' accumulation of mapped reads. Calculating
significance would be rather straight forward if the distribution
of mapped reads were random in the absence of chIP (e.g.
sequencing of input DNA). This does not appear to be
true. The method of DNA fragmentation, preferential
amplification in PCR, lack of independence in
observations, the degree of repetitiveness, and error in the
sequencing and alignment process are just a few of the
known sources of systematic bias that confound naive
expectation estimates.
Several methods have been developed to identify and
estimate confidence in ChIP-Seq peaks. Johnson et al. used an
ad hoc masking method based on their control input data
and prior qPCR validated regions to set a threshold and
assign confidence in their NRSF binding peaks [5].
Robertson et al. estimated global Poisson p-values for
windowed data using a rate set to 90% the bp size of the
genome. To estimate FDRs, a background model of
binding peaks was generated by randomizing their STAT1 data
and choosing a threshold that produced a 0.1% FDR [6].
Mikkelsen et al. took a remapping strategy that involved
aligning every 27 mer in the mouse genome back onto
itself to define unique and repetitive regions. For each
ChIP-Seq dataset, "nominal" p-values were calculated by
randomly assigning each read to a "unique region" and
comparing the observed randomized 1 kb window sums
to the real 1 kb window sums [7]. Mikkelsen et al. also
employed a Hidden Markov Model that awaits
description. Fejes et al. mention a Monte Carlo based FDR (...truncated)