Phasing analysis of lung cancer genomes using a long read sequencer
ARTICLE
https://doi.org/10.1038/s41467-022-31133-6
OPEN
Phasing analysis of lung cancer genomes using a
long read sequencer
1234567890():,;
Yoshitaka Sakamoto1,6, Shuhei Miyake1,6, Miho Oka1,2,6, Akinori Kanai1, Yosuke Kawai 3, Satoi Nagasawa1,
Yuichi Shiraishi4, Katsushi Tokunaga 3, Takashi Kohno 5, Masahide Seki 1, Yutaka Suzuki 1 ✉ &
Ayako Suzuki 1 ✉
Chromosomal backgrounds of cancerous mutations still remain elusive. Here, we conduct the
phasing analysis of non-small cell lung cancer specimens of 20 Japanese patients. By the
combinatory use of short and long read sequencing data, we obtain long phased blocks of
834 kb in N50 length with >99% concordance rate. By analyzing the obtained phasing
information, we reveal that several cancer genomes harbor regions in which mutations are
unevenly distributed to either of two haplotypes. Large-scale chromosomal rearrangement
events, which resemble chromothripsis events but have smaller scales, occur on only one
chromosome, and these events account for the observed biased distributions. Interestingly,
the events are characteristic of EGFR mutation-positive lung adenocarcinomas. Further
integration of long read epigenomic and transcriptomic data reveal that haploid chromosomes are not always at equivalent transcriptomic/epigenomic conditions. Distinct chromosomal backgrounds are responsible for later cancerous aberrations in a haplotype-specific
manner.
1 Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan. 2 Ono
Pharmaceutical Co., Ltd, Ibaraki, Japan. 3 Genome Medical Science Project (Toyama), National Center for Global Health and Medicine, Tokyo, Japan.
4 Division of Genome Analysis Platform Development, National Cancer Center Research Institute, Tokyo, Japan. 5 Division of Genome Biology, National
Cancer Center Research Institute, Tokyo, Japan. 6These authors contributed equally: Yoshitaka Sakamoto, Shuhei Miyake, Miho Oka. ✉email: ;
NATURE COMMUNICATIONS | (2022)13:3464 | https://doi.org/10.1038/s41467-022-31133-6 | www.nature.com/naturecommunications
1
ARTICLE
L
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-31133-6
arge-scale cancer genome studies have revealed numerous
cancer-related mutations and identified key driver genes1.
Several relevant drug targets and biomarkers have been
identified, such as EGFR and BRAF 2–5. So far, most studies have
been conducted using short read sequencers. Therefore, our
current knowledge has been limited mainly to mutations that
occur in small-scale regions of genomes; the so-called single
nucleotide variants (SNVs) and short insertions and deletions
(indels).
Recently, larger genomic structural variants (SVs) have been
identified in the genomes of various cancer types. These SVs are
expected to have no less biological and clinical relevance. For
example, both the chromosomal inversion and translocation
generate oncogenic fusion genes, such as BCR-ABL6,
EML4-ALK 7, and KIF5B-RET 8. In tumor-suppressor genes, such
as TP53, RB1, and PTEN, large deletions frequently occur, thereby
inactivating the expression and functions of these genes9. The
Pan-Cancer Analysis of Whole Genomes Consortium has also
focused on large-scale genomic aberrations in addition to SNVs.
The consortium reported the SV signatures of 38 cancer
subtypes10. Despite the potential relevance of SVs, conventional
detection methods are based on short read sequencing data11 and
have limited validity toward the precise detection of SVs. In fact,
the conventional analytical methodology may infer the presence
of SVs but can only partially reveal their complete structures. To
achieve a more direct and precise detection of SVs, long read
sequencing should be employed for interrogating of various
aspects of cancer genomes.
For this purpose, experimental and bioinformatics procedures
for long read sequencing have recently recorded substantial
progress. Although the fidelity of existing long read sequencing
technologies remains ~90% for a single-pass read, several efforts
have been collectively made to improve sequence accuracy12. For
example, circular consensus sequencing has been developed as a
means to construct more accurate sequences with 99% identity in
the PacBio platform13. Recently, Oxford Nanopore Technologies
(ONT) have announced the release of Q20 chemistry and basecalling system that enables single-pass sequencing with more than
99% accuracy. It is now realistic to use long read sequencers to
systematically analyze a wider range of cancerous mutations, such
as SNVs, relatively large-scale SVs and chromosomal-level rearrangements. In fact, several reports on the cancer genome long
read analysis have recently revealed that, occasionally, newly
discovered SVs demonstrate complex patterns of genomic
aberrations14–16.
Another unique advantage of employing long read sequencing
for cancer genome analysis lies in its potential to reveal chromosomal contexts in which cancerous mutations are harbored16.
Long read sequences should directly represent a mutual relationship between two mutations detected in the same read at a
single-molecule level. This so-called “haplotype phasing analysis”
would shed more light on a particular event occurring in a cancer
type on either of the chromosomes of diploid genomes at a single
molecule and haplotype resolution17. Each haplotype may reside
in a distinct condition, which might be due to their differential
DNA methylation or other epigenomic statuses possibly caused
by the original lineage-specific regulations or other cancerous
aberrant regulations at later steps18. Therefore, the consequentially occurring mutation patterns might serve as the
footprints of the cancer genome evolution and could contain
essential information for elucidating the causes and effects of
mutations in the same cancer genomes. It is possible that a better
understanding of such chromosomal contexts of cancerous
mutations will shed new light on cancerous events for patient
cases whose molecular etiology remains unknown from previous
short read sequencing and provide a novel therapeutic insight.
2
In this study, we conduct a phasing analysis of cancer genomes
combining short and long read sequencing technologies. We use
whole-genome sequencing (WGS) data obtained from Japanese
non-small cell lung cancer patients, where we identify a series of
complex SVs14. We have further enriched sequencing depths for
accurate phasing analysis and performed epigenome and transcriptome analyses. As such, we reveal the cancerous mutations
from their chromosomal backgrounds’ perspective. Here, we
demonstrate that the obtained phasing results provide essential
information for understanding the history of mutations and their
possible causes.
Results
Phasing analysis of a lung cancer genome. We performed our
phasing analysis using the long and short read WGS data
obtained from 20 non-small cell lung cancer specimens of Japanese patients. W (...truncated)