Loop detection using Hi-C data with HiCExplorer
GigaScience, 2022, 11, 1–9
DOI: 10.1093/gigascience/giac061
TECH NOTE
Loop detection using Hi-C data with HiCExplorer
Joachim Wolff
1,2,
*, Rolf Backofen
2,3
and Björn Grüning
2
1
Friedrich Miescher Institut for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
3
Signalling Research Centres CIBSS, University of Freiburg, Schänzlestr. 18, 79104 Freiburg, Germany
∗
Correspondence author. Joachim Wolff. Friedrich Miescher Institut for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland, E-mail:
,
2
Background: Chromatin loops are an essential factor in the structural organization of the genome; however, their detection in HiC interaction matrices is a challenging and compute-intensive task. The approach presented here, integrated into the HiCExplorer
software, shows a chromatin loop detection algorithm that applies a strict candidate selection based on continuous negative binomial
distributions and performs a Wilcoxon rank-sum test to detect enriched Hi-C interactions.
Results: HiCExplorer’s loop detection has a high detection rate and accuracy. It is the fastest available CPU implementation and utilizes
all threads offered by modern multicore platforms.
Conclusions: HiCExplorer’s method to detect loops by using a continuous negative binomial function combined with the donut approach from HiCCUPS leads to reliable and fast computation of loops. All the loop-calling algorithms investigated provide differing
results, which intersect by ∼ 50% at most. The tested in situ Hi-C data contain a large amount of noise; achieving better agreement
between loop calling algorithms will require cleaner Hi-C data and therefore future improvements to the experimental methods that
generate the data.
Keywords: Hi-C, Hi-C loop detection, DNA loops
Introduction
Many algorithms are currently available for loop detection in HiC data. HiCCUPS uses a donut algorithm, which considers all elements of a Hi-C interaction matrix as peaks and tests if the region around them is significantly different from the neighboring
interactions. HiCCUPS is part of the software Juicer,1 and the implementation requires a general-purpose GPU (GPGPU), which imposes a barrier for users without access to Nvidia GPUs. However, an experimental CPU-based implementation has also been
released. Algorithms such as iterative correction and eigenvector
decomposition (ICE) [1] or Knight–Ruiz (KR) [2] are widely used in
Hi-C data analysis for balancing Hi-C matrices, but the loop detection algorithm of HiCCUPS uses a different approach. HiCCUPS
employs a Poisson model, which is a distribution for discrete data,
to detect regions of interest. After balancing a Hi-C interaction
matrix, the data are no longer discrete but continuous. In order
to work with the Poisson distribution, the balancing of the values
is reverted. This procedure is methodologically questionable, as
it involves manipulation of the data to fit the requirements of a
particular distribution, rather than fitting on the distribution that
is most probable or suitable. Moreover, the Poisson distribution on
the raw Hi-C data tends to have an overdispersion, which suggests
Poisson is not the best choice. HOMER [3] creates a relative contact
matrix per chromosome and scans these for locally dense regions.
HOMER does not support standard file formats for Hi-C matrices
like cool [4], which forces the user to create all data from scratch,
a time-consuming process and a potential source of errors and
inaccuracies. Chromosight [5] detects loops based on a pattern1
https://github.com/aidenlab/juicer
matching algorithm. Cooltools2 uses a reimplementation of the
HiCCUPS algorithm; Fit-Hi-C 2 [6] detects significant Hi-C contacts
and provides a merging algorithm to detect DNA loops. Peakachu
[7] uses a random forest approach trained on CTCF or H3K27ac
data. Chromosight, cooltools, Peakachu, and HiCExplorer support
the cooler file format. HOMER, Fit-Hi-C 2, and Peakachu do not utilize parallelization techniques to improve runtime, running only
on a single core.
Here we present an algorithm that can detect Hi-C loops. It
is based on a continuous negative binomial distribution and is
highly parallelized, assigning one thread per chromosome and
parallelizing further using multiple threads within a chromosome. This approach makes full use of the resources available in
the last generation of multicore CPU platforms.
Methods
According to Rao et al. [8], most of the anchor points of detected
loops lie within a range of 2 Mb. This insight can be used to decrease the search space in a biologically meaningful way and
also to reduce the computational burden, while at the same time
maintaining a low memory footprint. Moreover, interaction pairs
with genomic distances that are too close to each other, corresponding to points in the Hi-C matrix close to the main diagonal, already have high interaction counts. It is, in many cases,
unlikely that these pairs contribute enrichments in the context
of their neighborhood. The high interaction count can explain
this observation between 2 loci; they are closer in 1-dimensional
2
https://github.com/open2c/cooltools
Received: February 28, 2021. Revised: June 23, 2021
C The Author(s) 2022. Published by Oxford University Press GigaScience. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided
the original work is properly cited.
Abstract
2 |
GigaScience, 2022, Vol. 11, No. 1
Algorithm
A strict candidate selection is critical to reducing the computational complexity of the loop detection algorithm. A maximum
loop size can be defined to restrict the search space to take the
previously mentioned observation from Rao et al. [8] into account.
In Hi-C, the primary data structure is the symmetrical n × n interaction count matrix (ICM):
⎡
⎤
ic00 · · · ic0n
⎢ .
. ⎥
⎥
ICM = ⎢
⎣ .. · · · .. ⎦
icn0 · · · icnn
(1)
And third, similar to HOMER’s normalization, a correction for
different occurring ligation events is offered:
exp_ligationd = exp_nonzeroi, j ∗
(rowICM (i)) ∗ (rowICM ( j))
(ICM)
Candidate selection per genomic distance
To detect enriched Hi-C interactions, the observed/expected normalized Hi-C data are fitted per genomic distance d independently
to a continuous negative binomial distribution. Supplementary
Fig. S1 shows the value density distribution of different genomic
distances and provides evidence for the chosen distribution assumption. The negative binomial function, rather than the Poisson distribution, is used because the raw data of the genomic distances of chromosome 1 of GM12878 cell line at 10 kb indicate
overdispersion [11] in a majority of the distances (80.1%); therefor (...truncated)