ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9272800/pdf/

ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data.

Bioinformatics, 38(14), 2022, 3523–3531 https://doi.org/10.1093/bioinformatics/btac387 Advance Access Publication Date: 9 June 2022 Original Paper Genome analysis ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data Itunu G. Osuntoki1,2,*, Andrew Harrison1, Hongsheng Dai1, Yanchun Bao1 and Nicolae Radu Zabet 3,4,* 1 Department of Mathematical Sciences, University of Essex, Colchester CO4 3SQ, UK, 2Statistics, Modelling and Economics Department, UK Health Security Agency, London NW9 5EQ, UK, 3School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK and 4Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, UK *To whom correspondence should be addressed. Associate Editor: Peter Robinson Received on October 8, 2021; revised on May 23, 2022; editorial decision on June 4, 2022; accepted on June 7, 2022 Abstract Motivation: Several computational and statistical methods have been developed to analyze data generated through the 3C-based methods, especially the Hi-C. Most of the existing methods do not account for dependency in Hi-C data. Results: Here, we present ZipHiC, a novel statistical method to explore Hi-C data focusing on the detection of enriched contacts. ZipHiC implements a Bayesian method based on a hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC) to detect interactions in two-dimensional space based on a Hi-C contact frequency matrix. ZipHiC uses data on the sources of biases related to the contact frequency matrix, allows borrowing information from neighbours using the Potts model and improves computation speed using the ABC model. In addition to outperforming existing tools on both simulated and real data, our model also provides insights into different sources of biases that affects Hi-C data. We show that some datasets display higher biases from DNA accessibility or Transposable Elements content. Furthermore, our analysis in Drosophila melanogaster showed that approximately half of the detected significant interactions connect promoters with other parts of the genome indicating a functional biological role. Finally, we found that the micro-C datasets display higher biases from DNA accessibility compared to a similar Hi-C experiment, but this can be corrected by ZipHiC. Availability and implementation: The R scripts are available at https://github.com/igosungithub/HMRFHiC.git. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Distant regulatory elements and their target genes are often separated by large genomic distances. In order for the regulatory element to activate a target gene, they need to come in 3D proximity (Bonev and Cavalli, 2016; Hua et al., 2021). This indicates that the spatial organization of the genome is intimately related to genome regulation and a better understanding of the 3D organization of the genome is important in disentangling the contribution of different factors to gene regulation. One of the recently developed genomewide proximity ligation assay is the Hi-C technique (LiebermanAiden et al., 2009), which is a chromosome conformation capture (3C)-based method. Hi-C is able to detect interactions (short-range and long-range) within and between chromosomes at high resolutions. While in mammalian systems, resolutions of 5 Kb have been achieved (Rao et al., 2014), in smaller genomes, such as Drosophila, sub-kilobase pair resolutions were obtained from Hi-C experiments C The Author(s) 2022. Published by Oxford University Press. V (Chathoth and Zabet, 2019; Cuben~ as-Potts et al., 2017; Eagen et al., 2017). In addition, datasets generated by Hi-C are highly reproducible between replicates and often highly conserved between tissues (Ghavi-Helm et al., 2014). Recent technological advances have pushed the resolution of conformation capture methods to base pair resolution in mammalian systems (Hua et al., 2021). The data generated by a Hi-C experiment can be represented as a matrix of contact frequencies between pairs of regions along the genome. These matrices are associated with biases (Yaffe and Tanay, 2011), such as the restriction fragment length, GC content of trimmed ligation junctions and mappability, but many additional factors may also contribute to the contact counts. Correcting for these biases is important and there have been several methods being proposed that take these biases into account (Hu et al., 2013; Imakaev et al., 2012; Servant et al., 2015; Yaffe and Tanay, 2011). 3523 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 3524 The Iterative Correction and Eigenvector decomposition (ICE) has been the most widely used method to account for biases associated with the Hi-C data, due to its simplicity and being parameterfree by assuming equal visibility across all regions of the genome (Imakaev et al., 2012). This equal visibility assumption considers that all regions can be probed by the method with same probability. However this assumption is not always true, because the visibility of areas could vary (Imakaev et al., 2012; Servant et al., 2015). In addition, ICE is computationally intensive because the Hi-C interaction matrix is of size OðN2 Þ, where N is the number of genomic regions. The study of Rao et al. (2014) generated one of the highest-resolution maps of the 3D organization of the human genome using an in situ Hi-C to probe the 3D architecture of genomes for DNA– DNA proximity ligation in intact nuclei. This has revealed that the human genome is organized into sub-compartments globally and contains about 10 000 chromatin loops (Rao et al., 2014). To account for biases in Hi-C data, Rao et al. (2014) adopts the matrixbalancing proposed in Knight and Ruiz (2013). In particular, peaks are called only when a pair of regions of the genome shows elevated contact frequency relative to the local background; i.e. peaks are called when the peak pixel is enriched as compared to other pixels in its neighbourhood. Other methods take into account potential dependence among pairs of regions of the genome (Jin et al., 2013). To accurately identify the chromatin interactions and loops with high sensitivity and resolution, they used data filtering techniques based on the strand orientation of Hi-C paired-end reads. This also allows detection of short genomic distance interactions between restriction fragments and their analysis shows the effects of GC content and mappability on the observed contact frequency. Interestingly, there seems to be a linear relationship between average trans-contact frequency and mappability (Jin et al., 2013). Loci that are in close 1D proximi (...truncated)