Integrative reconstruction of cancer genome karyotypes using InfoGenomeR
ARTICLE
https://doi.org/10.1038/s41467-021-22671-6
OPEN
Integrative reconstruction of cancer genome
karyotypes using InfoGenomeR
1234567890():,;
Yeonghun Lee
1 & Hyunju Lee
1✉
Annotation of structural variations (SVs) and base-level karyotyping in cancer cells remains
challenging. Here, we present Integrative Framework for Genome Reconstruction (InfoGenomeR)-a graph-based framework that can reconstruct individual SVs into karyotypes based
on whole-genome sequencing data, by integrating SVs, total copy number alterations, allelespecific copy numbers, and haplotype information. Using whole-genome sequencing data
sets of patients with breast cancer, glioblastoma multiforme, and ovarian cancer, we
demonstrate the analytical potential of InfoGenomeR. We identify recurrent derivative
chromosomes derived from chromosomes 11 and 17 in breast cancer samples, with homogeneously staining regions for CCND1 and ERBB2, and double minutes and breakage-fusionbridge cycles in glioblastoma multiforme and ovarian cancer samples, respectively. Moreover,
we show that InfoGenomeR can discriminate private and shared SVs between primary and
metastatic cancer sites that could contribute to tumour evolution. These findings indicate that
InfoGenomeR can guide targeted therapies by unravelling cancer-specific SVs on a genomewide scale.
1 School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea. ✉email:
NATURE COMMUNICATIONS | (2021)12:2467 | https://doi.org/10.1038/s41467-021-22671-6 | www.nature.com/naturecommunications
1
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-22671-6
C
ancer cells acquire numerous changes in their DNA,
ranging from point mutations to DNA rearrangements,
that ultimately result in a complex cancer-associated
genome. Recurrent chromosomal structural variations (SVs) have
been linked to tumorigenesis, including simple SVs such as tandem duplications, deletions, inversions and insertions, which
have been extensively studied1,2, as well as more complex SVs
such as translocations, fold-back inversions, chromothripsis,
homogeneously staining regions (HSRs, representing repetitive
gene amplification) and double minutes (DMs, extrachromosomal DNA)3,4. Traditional karyotyping techniques, such
as G-banding and fluorescent in situ hybridisation (FISH) can
reveal the presence of complex SVs in derivative chromosomes
(by-product of the recombination of multiple chromosomes with
intact centromeres) or marker chromosomes (abnormal chromosomes with unidentified genomic segments)5. However, owing
to their limited resolution (~5 Mb), standard karyotyping techniques cannot be used to accurately identify complex SVs in
derivative or marker chromosomes.
High-throughput sequencing has advanced our understanding
of SVs by resolving the genomic changes at the single-base level.
Early-stage methods have been developed to detect SVs using
discordant and split reads from sequencing data6–9; however,
these methods have limited detection ability for SV breakpoints in
local genomic windows. Recently, several methods10–19 that
integrate genomic information, such as cancer purity and ploidy,
total copy number alterations (CNAs), allele-specific CNAs and
haplotype information, have been developed to identify SVs. They
use a graph-based representation for rearranged cancer genomes
but do not analyse the actual karyotypes of linear and/or circular
chromosomes, thus, not producing karyotypic topologies such as
HSRs, DMs, or chromothripsis. Global reconstruction of genome
karyotypes in cancers may allow uncovering of the mechanism
underlying cancer development and evolution.
In this article, we present a method to reconstruct cancer
genome karyotypes based on complex topology analysis, providing a haplotype graph-based representation. Our graphbased framework, named Integrative Framework for Genome
Reconstruction (InfoGenomeR), uses a breakpoint graph to
model the connectivity among genomic segments on a genomewide scale using as input SV calls, unmapped reads, read-depth
information and single nucleotide polymorphisms (SNPs).
Furthermore, the InfoGenomeR tool classifies the rearrangement topologies and derives the cancer genome karyotypes
from the haplotype graphical output (Supplementary Fig. 1).
We show the analytical potential of our method by comparing it
with existing tools using simulation data and cancer cell line
data. Moreover, using WGS data from The Cancer Genome
Atlas (TCGA)20–22 and European Genome–phenome Archive
(EGA)23, we show that InfoGenomeR can reconstruct the
karyotypes of cancer cells and distinguish between private and
shared SVs in primary and metastatic cancer cells, and reveal
tumour evolution.
Results
InfoGenomeR reconstructs candidate genome karyotypes. First,
InfoGenomeR evaluates all reads in WGS data sets, generates
initial SV calls using the tools DELLY26, Manta7 and novoBreak8
(Fig. 1a), and performs initial CN segmentation using BIC-seq224.
Then, it constructs an initial breakpoint graph of local genomic
segments using the initial SV and CN breakpoints. The breakpoint graph is composed of nodes and segment edges, reference
edges, and SV edges. The following three-step iterations update
the initial breakpoint graph. In each iteration, (i) local genomic
segments are refined, (ii) integer CNs of genomic segments are
2
estimated using purity and ploidy (ABSOLUTE25) and (iii) the
integer programming of the CN balance condition26 determines
the edge multiplicities of the breakpoint graph and removes zeromultiplicity SVs. Each iteration restarts with the SV set without
zero-multiplicity SVs, CN segmentation is performed without the
previous false-positive SV breakpoints, and integer CNs of segments
are recalculated. Iterations are performed until the graph converges
(no zero-multiplicity SV is observed). The iterations are composed
of first and second rounds of iterations depending on the segmentation parameter, and the CN segments are merged with their
neighbour CN segments more commonly in the second-round
iterations than in the first-round iterations. At the intermediate step
between the first and second rounds of iterations, the discordant or
unmapped reads, which do not pair properly, are remapped to the
sequences of candidate adjacencies from unbalanced nodes.
(Fig. 1b). Then, candidate adjacencies supported by their reads are
generated, and the second-round iterations finalise the breakpoint
graph. Next, integer CNs are divided into ASCNs using negative
binomial models for the different depths of heterozygous SNPs, and
the expectation–maximisation (EM) algorithm is used for estimating parameters. Integer programming under the CN balance condition with the ASCNs constructs the allele-specific breakpoint
graph and then the imbalanced heterozygous SNP sequences are
phased (Fig. 1c). Genomic segments with balanced heterozygous
SNPs are phased using a hidden Markov model (BEAGLE27), and
the fi (...truncated)