Hi-C analysis: from data generation to integration (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs12551-018-0489-1.pdf

Hi-C analysis: from data generation to integration

Biophysical Reviews https://doi.org/10.1007/s12551-018-0489-1 REVIEW Hi-C analysis: from data generation to integration Koustav Pal 1 & Mattia Forcato 2 & Francesco Ferrari 1,3 Received: 12 October 2018 / Accepted: 3 December 2018 # The Author(s) 2018 Abstract In the epigenetics field, large-scale functional genomics datasets of ever-increasing size and complexity have been produced using experimental techniques based on high-throughput sequencing. In particular, the study of the 3D organization of chromatin has raised increasing interest, thanks to the development of advanced experimental techniques. In this context, Hi-C has been widely adopted as a high-throughput method to measure pairwise contacts between virtually any pair of genomic loci, thus yielding unprecedented challenges for analyzing and handling the resulting complex datasets. In this review, we focus on the increasing complexity of available Hi-C datasets, which parallels the adoption of novel protocol variants. We also review the complexity of the multiple data analysis steps required to preprocess Hi-C sequencing reads and extract biologically meaningful information. Finally, we discuss solutions for handling and visualizing such large genomics datasets. Keywords Chromatin 3D architecture . Epigenomics . Computational biology . High-throughput sequencing . Chromosome conformation capture The total length of DNA contained in a human cell would be 2 m long if completely stretched, i.e., considering the cumulative size of 6 billion nucleotides composing a diploid genome. However, such a long polymer must fit into a nucleus with an average diameter of 10 μm, i.e., five orders of magnitude shorter than the genome (Marti-Renom and Mirny 2011). This is not only a structural challenge, but also a functional one, as the genome must be densely packed, while at the same time preserving its function, i.e., being accessible to factors regulating transcription and replication. This is achieved thanks to the fact that the DNA inside the cell is never naked, but always associated to many proteins with a structural and functional role. The complex of DNA and Mattia Forcato and Francesco Ferrari contributed equally to this work. * Mattia Forcato * Francesco Ferrari 1 IFOM, the FIRC Institute of Molecular Oncology, Milan, Italy 2 Department of Life Sciences, University of Modena and Reggio Emilia, Modena, Italy 3 Institute of Molecular Genetics, National Research Council, Pavia, Italy associated proteins is named chromatin and its 3D organization inside the nucleus is not random but tightly regulated (Cavalli and Misteli 2013). Our knowledge of chromatin 3D organization has greatly increased over the past 20 years thanks to the development of novel experimental techniques, including high-resolution and high-throughput imaging techniques (Huang et al. 2010; Zane et al. 2017) and other molecular biology techniques. Among the latter, chromosome conformation capture (3C) (Dekker et al. 2002) and its high-throughput derivatives have been the most prominent ones. 3C allows probing physical interaction between non-adjacent genomic loci. The technique is based on cross-linking of DNA and associated proteins to stabilize chromatin 3D structure, then digesting DNA with restriction enzymes. The loose DNA fragment ends are then re-ligated, so as to obtain hybrid molecules, which may contain two fragments of DNA that were not adjacent but indeed far apart in the original linear genomic sequence. The fact that they are ligated together at the end of the process indicates some degree of physical proximity at the beginning of the experimental procedure. By analyzing the resulting hybrid molecules, we can assess the physical interaction between distant genomic loci (Belton et al. 2012). This can be assessed with PCR, using a pair of primers specifically designed to target predefined regions, as per the original 3C protocol. However, other high-throughput derivatives of 3C based on Biophys Rev microarrays hybridization (Dostie et al. 2006; Simonis et al. 2006) or high-throughput sequencing have been proposed subsequently. Among them, 4C allows detecting pairwise interactions between one target anchor point and potentially any other genomic region (van de Werken et al. 2012), whereas 5C allows probing multiple pairwise interactions between predesigned anchor points (Phillips-Cremins et al. 2013). HiC is the most comprehensive and high-throughput derivative, allowing us to score contact frequency between virtually any pair of genomic loci (Lieberman-Aiden et al. 2009). This results in very large and complex datasets, especially for large genomes, as the number of possible pairwise interactions increases exponentially with the genome length. As such in this review on big-data challenges in epigenomics, we will focus especially on datasets obtained from mammalian genomes, as well as on data analysis solutions used in this context. Hi-C data availability: increasing size and resolution Hi-C data allows examining the genome 3D organization at multiple scales (Rocha et al. 2015; Fraser et al. 2015). On a large scale, the genome is organized in distinct Bcompartments.^ Namely, active (BA^) and inactive (BB^) compartments have been identified from Hi-C contact maps analysis, and they correlate with the presence of active or inactive chromatin domains, respectively. The active compartment includes genomic regions characterized by transcription or epigenetic marks associated to open chromatin. Instead the inactive compartment covers regions with compact heterochromatin and gene expression silencing epigenetic marks (LiebermanAiden et al. 2009). When analyzing local patterns in the contact matrix instead, the topologically associating domains (TADs) emerge as a key feature, i.e., regions characterized by high intradomain contact frequency, and reduced interdomain contacts (Sexton et al. 2012; Dixon et al. 2012; Nora et al. 2012). On an even finer scale, Hi-C data have been used to identify specific points of contact between distant chromatin regions. Sometimes interactions are called chromatin loops, when referring to intrachromosomal (cis) contacts (Jin et al. 2013; Rao et al. 2014). This level of analysis is especially challenging for the resolution limit of Hi-C data. Hi-C data resolution is primarily defined by (1) the restriction enzymes used in the experimental procedure and by (2) the sequencing depth. Over the years, we have witnessed an attempt to increase the resolution of Hi-C data by working on these parameters, resulting in available datasets characterized by increasing size and resolution, reaching very high numbers of sequenced reads, especially for mammalian genomes. In addition, specific protocol variations have been proposed with the aim of improving the resolution. The classical Hi-C technique involves restriction digestion of a formaldehyde cross-linked genome with sequence specific restriction enzymes, followed (...truncated)