Integrative reconstruction of cancer genome karyotypes using InfoGenomeR (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-021-22671-6.pdf

Integrative reconstruction of cancer genome karyotypes using InfoGenomeR

ARTICLE https://doi.org/10.1038/s41467-021-22671-6 OPEN Integrative reconstruction of cancer genome karyotypes using InfoGenomeR 1234567890():,; Yeonghun Lee 1 & Hyunju Lee 1✉ Annotation of structural variations (SVs) and base-level karyotyping in cancer cells remains challenging. Here, we present Integrative Framework for Genome Reconstruction (InfoGenomeR)-a graph-based framework that can reconstruct individual SVs into karyotypes based on whole-genome sequencing data, by integrating SVs, total copy number alterations, allelespeciﬁc copy numbers, and haplotype information. Using whole-genome sequencing data sets of patients with breast cancer, glioblastoma multiforme, and ovarian cancer, we demonstrate the analytical potential of InfoGenomeR. We identify recurrent derivative chromosomes derived from chromosomes 11 and 17 in breast cancer samples, with homogeneously staining regions for CCND1 and ERBB2, and double minutes and breakage-fusionbridge cycles in glioblastoma multiforme and ovarian cancer samples, respectively. Moreover, we show that InfoGenomeR can discriminate private and shared SVs between primary and metastatic cancer sites that could contribute to tumour evolution. These ﬁndings indicate that InfoGenomeR can guide targeted therapies by unravelling cancer-speciﬁc SVs on a genomewide scale. 1 School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea. ✉email: NATURE COMMUNICATIONS | (2021)12:2467 | https://doi.org/10.1038/s41467-021-22671-6 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-22671-6 C ancer cells acquire numerous changes in their DNA, ranging from point mutations to DNA rearrangements, that ultimately result in a complex cancer-associated genome. Recurrent chromosomal structural variations (SVs) have been linked to tumorigenesis, including simple SVs such as tandem duplications, deletions, inversions and insertions, which have been extensively studied1,2, as well as more complex SVs such as translocations, fold-back inversions, chromothripsis, homogeneously staining regions (HSRs, representing repetitive gene ampliﬁcation) and double minutes (DMs, extrachromosomal DNA)3,4. Traditional karyotyping techniques, such as G-banding and ﬂuorescent in situ hybridisation (FISH) can reveal the presence of complex SVs in derivative chromosomes (by-product of the recombination of multiple chromosomes with intact centromeres) or marker chromosomes (abnormal chromosomes with unidentiﬁed genomic segments)5. However, owing to their limited resolution (~5 Mb), standard karyotyping techniques cannot be used to accurately identify complex SVs in derivative or marker chromosomes. High-throughput sequencing has advanced our understanding of SVs by resolving the genomic changes at the single-base level. Early-stage methods have been developed to detect SVs using discordant and split reads from sequencing data6–9; however, these methods have limited detection ability for SV breakpoints in local genomic windows. Recently, several methods10–19 that integrate genomic information, such as cancer purity and ploidy, total copy number alterations (CNAs), allele-speciﬁc CNAs and haplotype information, have been developed to identify SVs. They use a graph-based representation for rearranged cancer genomes but do not analyse the actual karyotypes of linear and/or circular chromosomes, thus, not producing karyotypic topologies such as HSRs, DMs, or chromothripsis. Global reconstruction of genome karyotypes in cancers may allow uncovering of the mechanism underlying cancer development and evolution. In this article, we present a method to reconstruct cancer genome karyotypes based on complex topology analysis, providing a haplotype graph-based representation. Our graphbased framework, named Integrative Framework for Genome Reconstruction (InfoGenomeR), uses a breakpoint graph to model the connectivity among genomic segments on a genomewide scale using as input SV calls, unmapped reads, read-depth information and single nucleotide polymorphisms (SNPs). Furthermore, the InfoGenomeR tool classiﬁes the rearrangement topologies and derives the cancer genome karyotypes from the haplotype graphical output (Supplementary Fig. 1). We show the analytical potential of our method by comparing it with existing tools using simulation data and cancer cell line data. Moreover, using WGS data from The Cancer Genome Atlas (TCGA)20–22 and European Genome–phenome Archive (EGA)23, we show that InfoGenomeR can reconstruct the karyotypes of cancer cells and distinguish between private and shared SVs in primary and metastatic cancer cells, and reveal tumour evolution. Results InfoGenomeR reconstructs candidate genome karyotypes. First, InfoGenomeR evaluates all reads in WGS data sets, generates initial SV calls using the tools DELLY26, Manta7 and novoBreak8 (Fig. 1a), and performs initial CN segmentation using BIC-seq224. Then, it constructs an initial breakpoint graph of local genomic segments using the initial SV and CN breakpoints. The breakpoint graph is composed of nodes and segment edges, reference edges, and SV edges. The following three-step iterations update the initial breakpoint graph. In each iteration, (i) local genomic segments are reﬁned, (ii) integer CNs of genomic segments are 2 estimated using purity and ploidy (ABSOLUTE25) and (iii) the integer programming of the CN balance condition26 determines the edge multiplicities of the breakpoint graph and removes zeromultiplicity SVs. Each iteration restarts with the SV set without zero-multiplicity SVs, CN segmentation is performed without the previous false-positive SV breakpoints, and integer CNs of segments are recalculated. Iterations are performed until the graph converges (no zero-multiplicity SV is observed). The iterations are composed of ﬁrst and second rounds of iterations depending on the segmentation parameter, and the CN segments are merged with their neighbour CN segments more commonly in the second-round iterations than in the ﬁrst-round iterations. At the intermediate step between the ﬁrst and second rounds of iterations, the discordant or unmapped reads, which do not pair properly, are remapped to the sequences of candidate adjacencies from unbalanced nodes. (Fig. 1b). Then, candidate adjacencies supported by their reads are generated, and the second-round iterations ﬁnalise the breakpoint graph. Next, integer CNs are divided into ASCNs using negative binomial models for the different depths of heterozygous SNPs, and the expectation–maximisation (EM) algorithm is used for estimating parameters. Integer programming under the CN balance condition with the ASCNs constructs the allele-speciﬁc breakpoint graph and then the imbalanced heterozygous SNP sequences are phased (Fig. 1c). Genomic segments with balanced heterozygous SNPs are phased using a hidden Markov model (BEAGLE27), and the ﬁ (...truncated)