ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data.
Bioinformatics, 38(14), 2022, 3523–3531
https://doi.org/10.1093/bioinformatics/btac387
Advance Access Publication Date: 9 June 2022
Original Paper
Genome analysis
ZipHiC: a novel Bayesian framework to identify enriched
interactions and experimental biases in Hi-C data
Itunu G. Osuntoki1,2,*, Andrew Harrison1, Hongsheng Dai1, Yanchun Bao1 and
Nicolae Radu Zabet 3,4,*
1
Department of Mathematical Sciences, University of Essex, Colchester CO4 3SQ, UK, 2Statistics, Modelling and Economics
Department, UK Health Security Agency, London NW9 5EQ, UK, 3School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK
and 4Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, UK
*To whom correspondence should be addressed.
Associate Editor: Peter Robinson
Received on October 8, 2021; revised on May 23, 2022; editorial decision on June 4, 2022; accepted on June 7, 2022
Abstract
Motivation: Several computational and statistical methods have been developed to analyze data generated through the
3C-based methods, especially the Hi-C. Most of the existing methods do not account for dependency in Hi-C data.
Results: Here, we present ZipHiC, a novel statistical method to explore Hi-C data focusing on the detection of
enriched contacts. ZipHiC implements a Bayesian method based on a hidden Markov random field (HMRF) model
and the Approximate Bayesian Computation (ABC) to detect interactions in two-dimensional space based on a Hi-C
contact frequency matrix. ZipHiC uses data on the sources of biases related to the contact frequency matrix, allows
borrowing information from neighbours using the Potts model and improves computation speed using the ABC
model. In addition to outperforming existing tools on both simulated and real data, our model also provides insights
into different sources of biases that affects Hi-C data. We show that some datasets display higher biases from DNA
accessibility or Transposable Elements content. Furthermore, our analysis in Drosophila melanogaster showed that
approximately half of the detected significant interactions connect promoters with other parts of the genome indicating a functional biological role. Finally, we found that the micro-C datasets display higher biases from DNA accessibility compared to a similar Hi-C experiment, but this can be corrected by ZipHiC.
Availability and implementation: The R scripts are available at https://github.com/igosungithub/HMRFHiC.git.
Contact:
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Distant regulatory elements and their target genes are often separated by large genomic distances. In order for the regulatory element
to activate a target gene, they need to come in 3D proximity (Bonev
and Cavalli, 2016; Hua et al., 2021). This indicates that the spatial
organization of the genome is intimately related to genome regulation and a better understanding of the 3D organization of the genome is important in disentangling the contribution of different
factors to gene regulation. One of the recently developed genomewide proximity ligation assay is the Hi-C technique (LiebermanAiden et al., 2009), which is a chromosome conformation capture
(3C)-based method. Hi-C is able to detect interactions (short-range
and long-range) within and between chromosomes at high resolutions. While in mammalian systems, resolutions of 5 Kb have been
achieved (Rao et al., 2014), in smaller genomes, such as Drosophila,
sub-kilobase pair resolutions were obtained from Hi-C experiments
C The Author(s) 2022. Published by Oxford University Press.
V
(Chathoth and Zabet, 2019; Cuben~
as-Potts et al., 2017; Eagen
et al., 2017). In addition, datasets generated by Hi-C are highly reproducible between replicates and often highly conserved between
tissues (Ghavi-Helm et al., 2014). Recent technological advances
have pushed the resolution of conformation capture methods to
base pair resolution in mammalian systems (Hua et al., 2021).
The data generated by a Hi-C experiment can be represented as
a matrix of contact frequencies between pairs of regions along the
genome. These matrices are associated with biases (Yaffe and
Tanay, 2011), such as the restriction fragment length, GC content
of trimmed ligation junctions and mappability, but many additional
factors may also contribute to the contact counts. Correcting
for these biases is important and there have been several
methods being proposed that take these biases into account (Hu
et al., 2013; Imakaev et al., 2012; Servant et al., 2015; Yaffe and
Tanay, 2011).
3523
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
3524
The Iterative Correction and Eigenvector decomposition (ICE)
has been the most widely used method to account for biases associated with the Hi-C data, due to its simplicity and being parameterfree by assuming equal visibility across all regions of the genome
(Imakaev et al., 2012). This equal visibility assumption considers
that all regions can be probed by the method with same probability.
However this assumption is not always true, because the visibility of
areas could vary (Imakaev et al., 2012; Servant et al., 2015). In addition, ICE is computationally intensive because the Hi-C interaction
matrix is of size OðN2 Þ, where N is the number of genomic regions.
The study of Rao et al. (2014) generated one of the highest-resolution maps of the 3D organization of the human genome using an
in situ Hi-C to probe the 3D architecture of genomes for DNA–
DNA proximity ligation in intact nuclei. This has revealed that the
human genome is organized into sub-compartments globally and
contains about 10 000 chromatin loops (Rao et al., 2014). To account for biases in Hi-C data, Rao et al. (2014) adopts the matrixbalancing proposed in Knight and Ruiz (2013). In particular, peaks
are called only when a pair of regions of the genome shows elevated
contact frequency relative to the local background; i.e. peaks are
called when the peak pixel is enriched as compared to other pixels in
its neighbourhood.
Other methods take into account potential dependence among
pairs of regions of the genome (Jin et al., 2013). To accurately identify the chromatin interactions and loops with high sensitivity and
resolution, they used data filtering techniques based on the strand
orientation of Hi-C paired-end reads. This also allows detection of
short genomic distance interactions between restriction fragments
and their analysis shows the effects of GC content and mappability
on the observed contact frequency. Interestingly, there seems to be a
linear relationship between average trans-contact frequency and
mappability (Jin et al., 2013).
Loci that are in close 1D proximi (...truncated)