DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing
Article
https://doi.org/10.1038/s41467-023-39784-9
DNA 5-methylcytosine detection and
methylation phasing using PacBio circular
consensus sequencing
Received: 20 November 2022
Check for updates
1234567890():,;
1234567890():,;
Accepted: 22 June 2023
Peng Ni 1,2,3,8, Fan Nie1,2,3,8, Zeyu Zhong 1,3, Jinrui Xu1,3, Neng Huang1,3,
Jun Zhang1,3, Haochen Zhao1,3, You Zou1,3, Yuanfeng Huang4, Jinchen Li 4,5,
Chuan-Le Xiao 6 , Feng Luo 7 & Jianxin Wang 1,2,3
Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in
detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive
genomic regions. However, existing methods for detecting 5mCpGs using
PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deeplearning method to detect DNA 5mCpGs using CCS reads. We sequence
polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA
of one human sample using PacBio CCS for training ccsmeth. Using long
(≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the
Curve on 5mCpG detection at single-molecule resolution. At the genome-wide
site level, ccsmeth achieves >0.90 correlations with bisulfite sequencing and
nanopore sequencing using only 10× reads. Furthermore, we develop a
Nextflow pipeline, ccsmethphase, to detect haplotype-aware methylation
using CCS reads, and then sequence a Chinese family trio to validate it.
ccsmeth and ccsmethphase can be robust and accurate tools for detecting
DNA 5-methylcytosines.
5-methylcytosine (5mC), the most common form of DNA methylation,
is involved in regulating many biological processes1. In humans, most
5mCs occur at CpG sites, which are associated with embryonic development, diseases, and aging2,3. Bisulfite sequencing (BS-seq) is now the
most widely used methodology for profiling 5mC methylation4. In a
bisulfite-treated genomic DNA, unmethylated cytosines are converted
to uracils, while methylated cytosines are unchanged5. Thus, the
methylation status of a segment of DNA can be yielded at singlenucleotide resolution. However, bisulfite treatment damages the DNA,
which further leads to DNA degradation and the loss of sequencing
diversity6. Recently, two bisulfite-free methods, ten-eleven translocation-assisted pyridine borane sequencing7 (TAPS) and enzymatic
methyl-seq8 (EM-seq) were also developed, which are both reported to
have more uniformly coverage and higher unique mapping rates than
BS-seq. Like BS-seq, TAPS and EM-seq can be applied to both shortread sequencing and long-read sequencing9–11. However, all these
methods need extra laboratory techniques, which further leads to
extra sequencing costs.
Two major long-read sequencing technologies, PacBio singlemolecule real-time (SMRT) sequencing and nanopore sequencing of
1
School of Computer Science and Engineering, Central South University, Changsha 410083, China. 2Xiangjiang Laboratory, Changsha 410205, China. 3Hunan
Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China. 4Bioinformatics Center, National Clinical Research Centre for
Geriatric Disorders, Department of Geriatrics, Xiangya Hospital, Central South University, Changsha 410000, China. 5Centre for Medical Genetics & Hunan
Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410000, China. 6State Key Laboratory of Ophthalmology,
Zhongshan Ophthalmic Center, Sun Yat-sen University, #7 Jinsui Road, Tianhe District, Guangzhou, China. 7School of Computing, Clemson University,
e-mail: ; ;
Clemson, SC 29634-0974, USA. 8These authors contributed equally: Peng Ni, Fan Nie.
Nature Communications | (2023)14:4054
1
Article
Oxford Nanopore Technologies (ONT), can directly sequence native
DNA without PCR amplification12,13. DNA base modifications alter
polymerase kinetics in SMRT sequencing and affect the electrical
current signals near the modified bases in nanopore sequencing13.
Thus, DNA base modifications can be directly detected from native
DNA reads of SMRT and nanopore sequencing without extra laboratory techniques12,13. For nanopore sequencing, computational methods
for 5mC detection either apply statistical tests to compare current
signals of native DNA reads with an unmodified control (Tombo14), or
use pre-trained Hidden Markov models (nanopolish15) and deep neural
network models (Megalodon16, DeepSignal17) without a control dataset. Previous studies have shown that methods using pre-trained
models achieve high accuracies for DNA 5mC detection from human
nanopore reads18,19.
Pulse signals in SMRT sequencing, which are associated with the
nucleotides in which the polymerization reaction is occurring13,20,
include the interpulse duration (IPD) and the pulse width (PW). IPD
represents the time duration between two consecutive sequenced
bases. PW represents the time duration of a base being sequenced20.
Besides the sequenced nucleotides, base modifications would also
influence pulse signals. Using the differences in pulse signals between
modified and unmodified bases, methods for detecting 5mC and other
base modifications from SMRT data have been developed21. However,
due to the low signal-to-noise ratio, the reliable calling of 5mC using
early version SMRT data requires high coverage of reads (up to
250×)12,13. Based on the fact that unmethylated CpGs in vertebrates
often range over long hypomethylated regions, Suzuki et al. proposed
AgIn, which improved the confidence of 5mCpG detection by combining the IPD features of neighboring CpGs from SMRT data22.
Recently, the PacBio circular consensus sequencing (CCS) technique
was presented23, in which subreads generated from a circularized
template in a single zero-mode waveguide (ZMW) are used to call a
consensus sequence (CCS/HiFi read) with high accuracy. Using the new
CCS technique, Tse et al. developed a convolutional neural network
(CNN)-based method, called holistic kinetic model (HK model), for
genome-wide 5mCpG detection in humans24. For a CCS read, the HK
model first calculates the mean IPD and PW values of each base after
aligning the subreads of the CCS read to the reference genome. Then,
for each CpG site in the CCS read, the HK model organizes the mean
IPD values, the mean PW values, and the sequence context surrounding the CpG into a feature matrix. At last, the HK model feeds the
feature matrix into the CNN-based model to get a methylation probability of the CpG24. HK model achieves above 90% sensitivity and
specificity on 5mCpG detection at read level (i.e., at single-molecule
resolution). However, the HK model requires relatively high CCS subread depth (at least 20× passed subreads for one CCS) for accurate
5mCpG detection, which limits the insert size in library preparation,
further limits the length of CCS reads. Following the HK model, PacBio
proposed another CNN-based method primrose25, which has been
claimed to have 85% read-level accuracy on 5mCpG detection (...truncated)