Loop detection using Hi-C data with HiCExplorer (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/gigascience/article-pdf/doi/10.1093/gigascience/giac061/44738861/giac061.pdf

Loop detection using Hi-C data with HiCExplorer

GigaScience, 2022, 11, 1–9 DOI: 10.1093/gigascience/giac061 TECH NOTE Loop detection using Hi-C data with HiCExplorer Joachim Wolff 1,2, *, Rolf Backofen 2,3 and Björn Grüning 2 1 Friedrich Miescher Institut for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany 3 Signalling Research Centres CIBSS, University of Freiburg, Schänzlestr. 18, 79104 Freiburg, Germany ∗ Correspondence author. Joachim Wolff. Friedrich Miescher Institut for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland, E-mail: , 2 Background: Chromatin loops are an essential factor in the structural organization of the genome; however, their detection in HiC interaction matrices is a challenging and compute-intensive task. The approach presented here, integrated into the HiCExplorer software, shows a chromatin loop detection algorithm that applies a strict candidate selection based on continuous negative binomial distributions and performs a Wilcoxon rank-sum test to detect enriched Hi-C interactions. Results: HiCExplorer’s loop detection has a high detection rate and accuracy. It is the fastest available CPU implementation and utilizes all threads offered by modern multicore platforms. Conclusions: HiCExplorer’s method to detect loops by using a continuous negative binomial function combined with the donut approach from HiCCUPS leads to reliable and fast computation of loops. All the loop-calling algorithms investigated provide differing results, which intersect by ∼ 50% at most. The tested in situ Hi-C data contain a large amount of noise; achieving better agreement between loop calling algorithms will require cleaner Hi-C data and therefore future improvements to the experimental methods that generate the data. Keywords: Hi-C, Hi-C loop detection, DNA loops Introduction Many algorithms are currently available for loop detection in HiC data. HiCCUPS uses a donut algorithm, which considers all elements of a Hi-C interaction matrix as peaks and tests if the region around them is significantly different from the neighboring interactions. HiCCUPS is part of the software Juicer,1 and the implementation requires a general-purpose GPU (GPGPU), which imposes a barrier for users without access to Nvidia GPUs. However, an experimental CPU-based implementation has also been released. Algorithms such as iterative correction and eigenvector decomposition (ICE) [1] or Knight–Ruiz (KR) [2] are widely used in Hi-C data analysis for balancing Hi-C matrices, but the loop detection algorithm of HiCCUPS uses a different approach. HiCCUPS employs a Poisson model, which is a distribution for discrete data, to detect regions of interest. After balancing a Hi-C interaction matrix, the data are no longer discrete but continuous. In order to work with the Poisson distribution, the balancing of the values is reverted. This procedure is methodologically questionable, as it involves manipulation of the data to fit the requirements of a particular distribution, rather than fitting on the distribution that is most probable or suitable. Moreover, the Poisson distribution on the raw Hi-C data tends to have an overdispersion, which suggests Poisson is not the best choice. HOMER [3] creates a relative contact matrix per chromosome and scans these for locally dense regions. HOMER does not support standard file formats for Hi-C matrices like cool [4], which forces the user to create all data from scratch, a time-consuming process and a potential source of errors and inaccuracies. Chromosight [5] detects loops based on a pattern1 https://github.com/aidenlab/juicer matching algorithm. Cooltools2 uses a reimplementation of the HiCCUPS algorithm; Fit-Hi-C 2 [6] detects significant Hi-C contacts and provides a merging algorithm to detect DNA loops. Peakachu [7] uses a random forest approach trained on CTCF or H3K27ac data. Chromosight, cooltools, Peakachu, and HiCExplorer support the cooler file format. HOMER, Fit-Hi-C 2, and Peakachu do not utilize parallelization techniques to improve runtime, running only on a single core. Here we present an algorithm that can detect Hi-C loops. It is based on a continuous negative binomial distribution and is highly parallelized, assigning one thread per chromosome and parallelizing further using multiple threads within a chromosome. This approach makes full use of the resources available in the last generation of multicore CPU platforms. Methods According to Rao et al. [8], most of the anchor points of detected loops lie within a range of 2 Mb. This insight can be used to decrease the search space in a biologically meaningful way and also to reduce the computational burden, while at the same time maintaining a low memory footprint. Moreover, interaction pairs with genomic distances that are too close to each other, corresponding to points in the Hi-C matrix close to the main diagonal, already have high interaction counts. It is, in many cases, unlikely that these pairs contribute enrichments in the context of their neighborhood. The high interaction count can explain this observation between 2 loci; they are closer in 1-dimensional 2 https://github.com/open2c/cooltools Received: February 28, 2021. Revised: June 23, 2021 C The Author(s) 2022. Published by Oxford University Press GigaScience. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract 2 | GigaScience, 2022, Vol. 11, No. 1 Algorithm A strict candidate selection is critical to reducing the computational complexity of the loop detection algorithm. A maximum loop size can be defined to restrict the search space to take the previously mentioned observation from Rao et al. [8] into account. In Hi-C, the primary data structure is the symmetrical n × n interaction count matrix (ICM): ⎡ ⎤ ic00 · · · ic0n ⎢ . . ⎥ ⎥ ICM = ⎢ ⎣ .. · · · .. ⎦ icn0 · · · icnn (1) And third, similar to HOMER’s normalization, a correction for different occurring ligation events is offered: exp_ligationd = exp_nonzeroi, j ∗ (rowICM (i)) ∗ (rowICM ( j)) (ICM) Candidate selection per genomic distance To detect enriched Hi-C interactions, the observed/expected normalized Hi-C data are fitted per genomic distance d independently to a continuous negative binomial distribution. Supplementary Fig. S1 shows the value density distribution of different genomic distances and provides evidence for the chosen distribution assumption. The negative binomial function, rather than the Poisson distribution, is used because the raw data of the genomic distances of chromosome 1 of GM12878 cell line at 10 kb indicate overdispersion [11] in a majority of the distances (80.1%); therefor (...truncated)