BART: a transcription factor prediction tool with query gene sets or epigenomic profiles
Bioinformatics, 34(16), 2018, 2867–2869
doi: 10.1093/bioinformatics/bty194
Advance Access Publication Date: 28 March 2018
Applications Note
Data and text mining
BART: a transcription factor prediction tool with
query gene sets or epigenomic profiles
1
Center for Public Health Genomics, 2Department of Biomedical Engineering, 3Department of Public Health
Sciences, 4Department of Biochemistry and Molecular Genetics and 5Cancer Center, University of Virginia,
Charlottesville, VA 22908, USA
*To whom correspondence should be addressed.
Associate Editor: Jonathan Wren
Received on December 18, 2017; revised on March 9, 2018; editorial decision on March 18, 2018; accepted on March 27, 2018
Abstract
Summary: Identification of functional transcription factors that regulate a given gene set is an important problem in gene regulation studies. Conventional approaches for identifying transcription
factors, such as DNA sequence motif analysis, are unable to predict functional binding of specific
factors and not sensitive enough to detect factors binding at distal enhancers. Here, we present
binding analysis for regulation of transcription (BART), a novel computational method and software package for predicting functional transcription factors that regulate a query gene set or associate with a query genomic profile, based on more than 6000 existing ChIP-seq datasets for over
400 factors in human or mouse. This method demonstrates the advantage of utilizing publicly available data for functional genomics research.
Availability and implementation: BART is implemented in Python and available at http://faculty.vir
ginia.edu/zanglab/bart.
Contact:
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Transcriptional regulation of gene expression plays a critical role in
many cellular processes, including cancer development and progression (Bradner et al., 2017; Lambert et al., 2018). Identification of
functional transcription factors is essential for understanding gene
regulatory mechanisms in such processes. In gene expression profiling studies, ontology and pathway analyses (Huang et al., 2009;
McLean et al., 2010; Subramanian et al., 2005) can identify functional annotations of differentially expressed genes; however, this
approach is unable to predict transcription factors that regulate
those gene sets. Most existing methods for cis-regulatory prediction
rely upon detecting overrepresented DNA sequence motifs near the
gene promoters to identify sequence-specific DNA-binding factors
(Boeva, 2016). Such methods are limited by the context-specific nature of transcription factor activity and by multiple factors sharing
similar motifs (Jolma et al., 2013). Moreover, most cis-regulatory
events in mammalian genomes occur at distal enhancers, which
cover much larger regions than promoters but without direct assignment to target genes; these regions are usually difficult to capture by
motif scan alone (Shlyueva et al., 2014).
Several methods have been developed to overcome these limitations of motif-based, promoter-biased approaches using comprehensive epigenomic information (Dozmorov, 2017), such as DNaseI
hypersensitive sites (Sheffield et al., 2013). Model-based analysis of
regulation of gene expression (MARGE) is a method developed for
modeling differential gene expression using a compendium of public
H3K27ac ChIP-seq datasets (Wang et al., 2016). By quantifying the
regulatory potential of active enhancer histone mark H3K27ac on
each gene in the genome from each ChIP-seq dataset, MARGE uses
a semi-supervised learning approach to predict a genome-wide
C The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail:
V
2867
Zhenjia Wang1, Mete Civelek1,2, Clint L. Miller1,2,3,4,
Nathan C. Sheffield1,2,3,4, Michael J. Guertin1,4 and
Chongzhi Zang1,2,3,4,5,*
2868
Z.Wang et al.
cis-regulatory profile for any query gene set. Leveraging over 6000
transcription factor ChIP-seq datasets available in the public domain
(Mei et al., 2017), we have developed binding analysis for regulation
of transcription (BART), a new method for prediction of functional
transcription factors by associating ChIP-seq binding information
with MARGE-predicted genomic cis-regulatory regions.
2 Materials and methods
A
B
C
D
E
F
Fig. 1. BART workflow. (A) Cis-regulatory profile is generated from query gene
set by MARGE or from a ChIP-seq dataset by genomic mapping. Yellow bars indicate UDHS. (B) Each transcription factor binding profile from a ChIP-seq dataset is converted to a binary string showing presence or absence at each UDHS.
(C) Top: Each ROC curve represents the prediction performance of a transcription factor profile from B by the query cis-regulatory profile from A; Bottom:
Area under the ROC curve (AUC) is calculated for all datasets. (D) AUC are
grouped by factor, and Wilcoxon test is performed for each factor compared
with all datasets as background. In this example, cumulative distributions show
significantly higher AUC for TF_a (red). (E) Wilcoxon test statistic is calculated
for each transcription factor from each dataset in the background for Z-score
calculation. (F) BART outputs a ranked list of all transcription factors
3 Results and discussion
We tested BART on several gene sets derived from differentially expressed genes after activation or inhibition of known transcription
factors, including ESR1, AR, NR3C1, PPARG, NOTCH1, and
POU5F1 (Wang et al., 2016). In the BART result, the true functional
factor was ranked on top (1/454) of the candidates in 4/6 gene sets;
and ranked No.2 and No.47 for ESR1 and NR3C1, respectively
(Supplementary Fig. S4). The highest ranked factor predicted from
NR3C1 target genes is NR2A1, another nuclear receptor. The correct predictions are robust and not affected by randomness in
MARGE outputs (Supplementary Fig. S5). These results indicate
that BART can successfully predict transcription factors that regulate a given gene set. To evaluate the performance of BART, we
compare BART with four other transcription factor prediction tools
that take a gene set as query, including the ENCODE ChIP-seq
Significance Tool (Auerbach et al., 2013), HOMER (Heinz et al.,
2010), iRegulon (Janky et al., 2014) and Pscan (Zambelli et al.,
2009) (Supplementary Table S1). On prediction of the true factor
from the six gene sets, BART outperforms other methods for five
cases, except NR3C1 (Supplementary Table S2).
BART can identify transcription factors that regulate any gene set
or associate with any genomic profile. BART provides functional interpretations to differential gene expression analysis. BART makes
predictions based on direct binding information from public ChIP-seq
data only, as an orthogonal approach to conventional DNA sequence
motif search. It focuses on transcription factor binding at open chromatin regions in the genome represented by UDHS, most of which are
locat (...truncated)