Reconstructing spatial organizations of chromosomes through manifold learning
Nucleic Acids Research
Reconstructing spatial organizations of chromosomes through manifold learning
Guangxiang Zhu 2
Wenxuan Deng 1
Hailin Hu 0
Rui Ma 2
Sai Zhang 2
Jinglin Yang 2
Jian Peng 4
Tommy Kaplan 3
Jianyang Zeng 2
0 School of Medicine, Tsinghua University , Beijing 100084 , China
1 Department of Biostatistics, Yale University , New Haven, CT , USA
2 Institute for Interdisciplinary Information Sciences, Tsinghua University , Beijing 100084 , China
3 School of Computer Science and Engineering, The Hebrew University of Jerusalem , Jerusalem, 91904 , Israel
4 Department of Computer Science, University of Illinois at Urbana-Champaign , Urbana, IL , USA
Decoding the spatial organizations of chromosomes has crucial implications for studying eukaryotic gene regulation. Recently, chromosomal conformation capture based technologies, such as Hi-C, have been widely used to uncover the interaction frequencies of genomic loci in a high-throughput and genome-wide manner and provide new insights into the folding of three-dimensional (3D) genome structure. In this paper, we develop a novel manifold learning based framework, called GEM (Genomic organization reconstructor based on conformational Energy and Manifold learning), to reconstruct the threedimensional organizations of chromosomes by integrating Hi-C data with biophysical feasibility. Unlike previous methods, which explicitly assume specific relationships between Hi-C interaction frequencies and spatial distances, our model directly embeds the neighboring affinities from Hi-C space into 3D Euclidean space. Extensive validations demonstrated that GEM not only greatly outperformed other stateof-art modeling methods but also provided a physically and physiologically valid 3D representations of the organizations of chromosomes. Furthermore, we for the first time apply the modeled chromatin structures to recover long-range genomic interactions missing from original Hi-C data.
INTRODUCTION
The three-dimensional (3D) organizations of chromosomes
in nucleus are closely related to diverse genomic
functions, such as transcription regulation, DNA replication
and genome integrity (
1â4
). Therefore, decoding the 3D
genomic architecture has important implications in
revealing the underlying mechanisms of gene activities.
Unfortunately, our current understanding on the 3D genome
folding and the related cellular functions still remains largely
limited. In recent years, the proximity ligation based
chromosome conformation capture (3C) (
5,6
), and its extended
methods, such as Hi-C (
7
) and chromatin interaction
analysis by paired-end tag sequencing (ChIA-PET) (
8
), have
provided a revolutionary tool to study the 3D
organizations of chromosomes at different resolutions in various cell
types, organisms and species by measuring the interaction
frequencies between genomic loci nearby in space.
To gain better mechanistic insights into understanding
the 3D folding of the genome, it is necessary to reconstruct
the 3D spatial arrangements of chromosomes based on
the interaction frequencies derived from 3C-based data.
Indeed, the modeling results of 3D genome structure can shed
light on the relationship between complex chromatin
structure and its regulatory functions in controlling genomic
activities (
1â4
). However, the modeling of 3D chromatin
structure is not a trivial task, as it is often complicated by
uncertainty and sparsity in experimental data, as well as high
dynamics and stochasticity of chromatin structure itself.
Generally speaking, in the 3D genome structure modeling
problem, we are given Hi-C data, which can be represented by
a matrix where each element represents the interaction
frequency of a pair of genomic loci, and our goal is to
reconstruct the 3D organization of genome structure and obtain
the 3D spatial coordinates of all genomic loci. In practice, in
addition to Hi-C data, additional known constraints, such
as the shape and size of the nucleus, can also be integrated to
achieve more reliable modeling results and further enhance
the physical and biological relevance of the reconstructed
genomic structure (
9,10
).
In recent years, numerous computational methods have
been developed to reconstruct the 3D organizations of
chromosomes (
5,7,11â28
). Most of these approaches, such as
the multidimensional scaling (MDS) (
29,30
) based method,
ChromSDE (
17
), ShRec3D (
18
) and miniMDS (
27
),
heavily depended on the formula Fâ1/D to represent the
conversion from interaction frequencies F to spatial distances
D (where is a constant). Instead of using the above
relationship of inverse proportion, BACH (
16
) employed a
Poisson distribution to define the relation between Hi-C
interaction frequencies, spatial distances and other genomic
features (e.g., fragment length, GC content and
mappability score). After converting Hi-C interaction frequencies
into distances, these previous modeling approaches applied
various strategies to recon (...truncated)