Hi-C analysis: from data generation to integration
Biophysical Reviews
https://doi.org/10.1007/s12551-018-0489-1
REVIEW
Hi-C analysis: from data generation to integration
Koustav Pal 1 & Mattia Forcato 2 & Francesco Ferrari 1,3
Received: 12 October 2018 / Accepted: 3 December 2018
# The Author(s) 2018
Abstract
In the epigenetics field, large-scale functional genomics datasets of ever-increasing size and complexity have been produced
using experimental techniques based on high-throughput sequencing. In particular, the study of the 3D organization of chromatin
has raised increasing interest, thanks to the development of advanced experimental techniques. In this context, Hi-C has been
widely adopted as a high-throughput method to measure pairwise contacts between virtually any pair of genomic loci, thus
yielding unprecedented challenges for analyzing and handling the resulting complex datasets. In this review, we focus on the
increasing complexity of available Hi-C datasets, which parallels the adoption of novel protocol variants. We also review the
complexity of the multiple data analysis steps required to preprocess Hi-C sequencing reads and extract biologically meaningful
information. Finally, we discuss solutions for handling and visualizing such large genomics datasets.
Keywords Chromatin 3D architecture . Epigenomics . Computational biology . High-throughput sequencing . Chromosome
conformation capture
The total length of DNA contained in a human cell would be
2 m long if completely stretched, i.e., considering the cumulative size of 6 billion nucleotides composing a diploid genome. However, such a long polymer must fit into a nucleus
with an average diameter of 10 μm, i.e., five orders of magnitude shorter than the genome (Marti-Renom and Mirny
2011). This is not only a structural challenge, but also a functional one, as the genome must be densely packed, while at the
same time preserving its function, i.e., being accessible to
factors regulating transcription and replication. This is
achieved thanks to the fact that the DNA inside the cell is
never naked, but always associated to many proteins with a
structural and functional role. The complex of DNA and
Mattia Forcato and Francesco Ferrari contributed equally to this work.
* Mattia Forcato
* Francesco Ferrari
1
IFOM, the FIRC Institute of Molecular Oncology, Milan, Italy
2
Department of Life Sciences, University of Modena and Reggio
Emilia, Modena, Italy
3
Institute of Molecular Genetics, National Research Council,
Pavia, Italy
associated proteins is named chromatin and its 3D organization inside the nucleus is not random but tightly regulated
(Cavalli and Misteli 2013).
Our knowledge of chromatin 3D organization has greatly
increased over the past 20 years thanks to the development of
novel experimental techniques, including high-resolution and
high-throughput imaging techniques (Huang et al. 2010; Zane
et al. 2017) and other molecular biology techniques. Among
the latter, chromosome conformation capture (3C) (Dekker
et al. 2002) and its high-throughput derivatives have been
the most prominent ones. 3C allows probing physical interaction between non-adjacent genomic loci. The technique is
based on cross-linking of DNA and associated proteins to
stabilize chromatin 3D structure, then digesting DNA with
restriction enzymes. The loose DNA fragment ends are then
re-ligated, so as to obtain hybrid molecules, which may contain two fragments of DNA that were not adjacent but indeed
far apart in the original linear genomic sequence. The fact that
they are ligated together at the end of the process indicates
some degree of physical proximity at the beginning of the
experimental procedure. By analyzing the resulting hybrid
molecules, we can assess the physical interaction between
distant genomic loci (Belton et al. 2012). This can be assessed
with PCR, using a pair of primers specifically designed to
target predefined regions, as per the original 3C protocol.
However, other high-throughput derivatives of 3C based on
Biophys Rev
microarrays hybridization (Dostie et al. 2006; Simonis et al.
2006) or high-throughput sequencing have been proposed
subsequently. Among them, 4C allows detecting pairwise interactions between one target anchor point and potentially any
other genomic region (van de Werken et al. 2012), whereas 5C
allows probing multiple pairwise interactions between
predesigned anchor points (Phillips-Cremins et al. 2013). HiC is the most comprehensive and high-throughput derivative,
allowing us to score contact frequency between virtually any
pair of genomic loci (Lieberman-Aiden et al. 2009). This results in very large and complex datasets, especially for large
genomes, as the number of possible pairwise interactions increases exponentially with the genome length. As such in this
review on big-data challenges in epigenomics, we will focus
especially on datasets obtained from mammalian genomes, as
well as on data analysis solutions used in this context.
Hi-C data availability: increasing size
and resolution
Hi-C data allows examining the genome 3D organization at
multiple scales (Rocha et al. 2015; Fraser et al. 2015). On a
large scale, the genome is organized in distinct Bcompartments.^ Namely, active (BA^) and inactive (BB^) compartments have been identified from Hi-C contact maps analysis,
and they correlate with the presence of active or inactive chromatin domains, respectively. The active compartment includes
genomic regions characterized by transcription or epigenetic
marks associated to open chromatin. Instead the inactive compartment covers regions with compact heterochromatin and
gene expression silencing epigenetic marks (LiebermanAiden et al. 2009). When analyzing local patterns in the contact matrix instead, the topologically associating domains
(TADs) emerge as a key feature, i.e., regions characterized
by high intradomain contact frequency, and reduced
interdomain contacts (Sexton et al. 2012; Dixon et al. 2012;
Nora et al. 2012). On an even finer scale, Hi-C data have been
used to identify specific points of contact between distant
chromatin regions. Sometimes interactions are called chromatin loops, when referring to intrachromosomal (cis) contacts
(Jin et al. 2013; Rao et al. 2014). This level of analysis is
especially challenging for the resolution limit of Hi-C data.
Hi-C data resolution is primarily defined by (1) the restriction enzymes used in the experimental procedure and by (2)
the sequencing depth. Over the years, we have witnessed an
attempt to increase the resolution of Hi-C data by working on
these parameters, resulting in available datasets characterized
by increasing size and resolution, reaching very high numbers
of sequenced reads, especially for mammalian genomes. In
addition, specific protocol variations have been proposed with
the aim of improving the resolution.
The classical Hi-C technique involves restriction digestion
of a formaldehyde cross-linked genome with sequence specific
restriction enzymes, followed (...truncated)