DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-023-39784-9.pdf

DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing

Article https://doi.org/10.1038/s41467-023-39784-9 DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing Received: 20 November 2022 Check for updates 1234567890():,; 1234567890():,; Accepted: 22 June 2023 Peng Ni 1,2,3,8, Fan Nie1,2,3,8, Zeyu Zhong 1,3, Jinrui Xu1,3, Neng Huang1,3, Jun Zhang1,3, Haochen Zhao1,3, You Zou1,3, Yuanfeng Huang4, Jinchen Li 4,5, Chuan-Le Xiao 6 , Feng Luo 7 & Jianxin Wang 1,2,3 Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive genomic regions. However, existing methods for detecting 5mCpGs using PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deeplearning method to detect DNA 5mCpGs using CCS reads. We sequence polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA of one human sample using PacBio CCS for training ccsmeth. Using long (≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the Curve on 5mCpG detection at single-molecule resolution. At the genome-wide site level, ccsmeth achieves >0.90 correlations with bisulﬁte sequencing and nanopore sequencing using only 10× reads. Furthermore, we develop a Nextﬂow pipeline, ccsmethphase, to detect haplotype-aware methylation using CCS reads, and then sequence a Chinese family trio to validate it. ccsmeth and ccsmethphase can be robust and accurate tools for detecting DNA 5-methylcytosines. 5-methylcytosine (5mC), the most common form of DNA methylation, is involved in regulating many biological processes1. In humans, most 5mCs occur at CpG sites, which are associated with embryonic development, diseases, and aging2,3. Bisulﬁte sequencing (BS-seq) is now the most widely used methodology for proﬁling 5mC methylation4. In a bisulﬁte-treated genomic DNA, unmethylated cytosines are converted to uracils, while methylated cytosines are unchanged5. Thus, the methylation status of a segment of DNA can be yielded at singlenucleotide resolution. However, bisulﬁte treatment damages the DNA, which further leads to DNA degradation and the loss of sequencing diversity6. Recently, two bisulﬁte-free methods, ten-eleven translocation-assisted pyridine borane sequencing7 (TAPS) and enzymatic methyl-seq8 (EM-seq) were also developed, which are both reported to have more uniformly coverage and higher unique mapping rates than BS-seq. Like BS-seq, TAPS and EM-seq can be applied to both shortread sequencing and long-read sequencing9–11. However, all these methods need extra laboratory techniques, which further leads to extra sequencing costs. Two major long-read sequencing technologies, PacBio singlemolecule real-time (SMRT) sequencing and nanopore sequencing of 1 School of Computer Science and Engineering, Central South University, Changsha 410083, China. 2Xiangjiang Laboratory, Changsha 410205, China. 3Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China. 4Bioinformatics Center, National Clinical Research Centre for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital, Central South University, Changsha 410000, China. 5Centre for Medical Genetics & Hunan Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha 410000, China. 6State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, #7 Jinsui Road, Tianhe District, Guangzhou, China. 7School of Computing, Clemson University, e-mail: ; ; Clemson, SC 29634-0974, USA. 8These authors contributed equally: Peng Ni, Fan Nie. Nature Communications | (2023)14:4054 1 Article Oxford Nanopore Technologies (ONT), can directly sequence native DNA without PCR ampliﬁcation12,13. DNA base modiﬁcations alter polymerase kinetics in SMRT sequencing and affect the electrical current signals near the modiﬁed bases in nanopore sequencing13. Thus, DNA base modiﬁcations can be directly detected from native DNA reads of SMRT and nanopore sequencing without extra laboratory techniques12,13. For nanopore sequencing, computational methods for 5mC detection either apply statistical tests to compare current signals of native DNA reads with an unmodiﬁed control (Tombo14), or use pre-trained Hidden Markov models (nanopolish15) and deep neural network models (Megalodon16, DeepSignal17) without a control dataset. Previous studies have shown that methods using pre-trained models achieve high accuracies for DNA 5mC detection from human nanopore reads18,19. Pulse signals in SMRT sequencing, which are associated with the nucleotides in which the polymerization reaction is occurring13,20, include the interpulse duration (IPD) and the pulse width (PW). IPD represents the time duration between two consecutive sequenced bases. PW represents the time duration of a base being sequenced20. Besides the sequenced nucleotides, base modiﬁcations would also inﬂuence pulse signals. Using the differences in pulse signals between modiﬁed and unmodiﬁed bases, methods for detecting 5mC and other base modiﬁcations from SMRT data have been developed21. However, due to the low signal-to-noise ratio, the reliable calling of 5mC using early version SMRT data requires high coverage of reads (up to 250×)12,13. Based on the fact that unmethylated CpGs in vertebrates often range over long hypomethylated regions, Suzuki et al. proposed AgIn, which improved the conﬁdence of 5mCpG detection by combining the IPD features of neighboring CpGs from SMRT data22. Recently, the PacBio circular consensus sequencing (CCS) technique was presented23, in which subreads generated from a circularized template in a single zero-mode waveguide (ZMW) are used to call a consensus sequence (CCS/HiFi read) with high accuracy. Using the new CCS technique, Tse et al. developed a convolutional neural network (CNN)-based method, called holistic kinetic model (HK model), for genome-wide 5mCpG detection in humans24. For a CCS read, the HK model ﬁrst calculates the mean IPD and PW values of each base after aligning the subreads of the CCS read to the reference genome. Then, for each CpG site in the CCS read, the HK model organizes the mean IPD values, the mean PW values, and the sequence context surrounding the CpG into a feature matrix. At last, the HK model feeds the feature matrix into the CNN-based model to get a methylation probability of the CpG24. HK model achieves above 90% sensitivity and speciﬁcity on 5mCpG detection at read level (i.e., at single-molecule resolution). However, the HK model requires relatively high CCS subread depth (at least 20× passed subreads for one CCS) for accurate 5mCpG detection, which limits the insert size in library preparation, further limits the length of CCS reads. Following the HK model, PacBio proposed another CNN-based method primrose25, which has been claimed to have 85% read-level accuracy on 5mCpG detection (...truncated)