Linear assembly of a human centromere on the Y chromosome

Nature Biotechnology, Mar 2018

The human genome reference sequence remains incomplete owing to the challenge of assembling long tracts of near-identical tandem repeats in centromeres. We implemented a nanopore sequencing strategy to generate high-quality reads that span hundreds of kilobases of highly repetitive DNA in a human Y chromosome centromere. Combining these data with short-read variant validation, we assembled and characterized the centromeric region of a human Y chromosome.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/nbt.4109.pdf

Linear assembly of a human centromere on the Y chromosome

B r i e f c o m m u n i c at i o n s OPEN © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. longboard strategy, we linearize the circular BAC with a single cut Linear assembly of a human site, then add sequencing adaptors (Fig. 1a). The BAC DNA passes the pore, resulting in complete, end-to-end sequence covercentromere on the Y chromosome through age of the entire insert. Plots of read length versus megabase yield Miten Jain1,5 , Hugh E Olsen1,5, Daniel J Turner2, David Stoddart2, Kira V Bulazel3, Benedict Paten1, David Haussler1, Huntington F Willard3,4, Mark Akeson1 & Karen H Miga1,3 The human genome reference sequence remains incomplete owing to the challenge of assembling long tracts of nearidentical tandem repeats in centromeres. We implemented a nanopore sequencing strategy to generate high-quality reads that span hundreds of kilobases of highly repetitive DNA in a human Y chromosome centromere. Combining these data with short-read variant validation, we assembled and characterized the centromeric region of a human Y chromosome. Centromeres facilitate spindle attachment and ensure proper chromosome segregation during cell division. Normal human centromeres are enriched with AT-rich ~171-bp tandem repeats known as alpha satellite DNA1. Most alpha satellite DNAs are organized into higher order repeats (HORs), in which chromosome-specific alpha satellite repeat units, or monomers, are reiterated as a single repeat structure hundreds or thousands of times with high (>99%) sequence conservation to form extensive arrays2. Characterizing both the sequence composition of individual HOR structures and the extent of repeat variation is crucial to understanding kinetochore assembly and centromere identity3–5. However, no sequencing technology (including single-molecule real-time (SMRT) sequencing or synthetic long-read technologies) or a combination of sequencing technologies has been able to assemble centromeric regions because extremely high-quality, long reads are needed to confidently traverse low-copy sequence variants. As a result, human centromeric regions remain absent from even the most complete chromosome assemblies. Here we apply nanopore long-read sequencing to produce highquality reads that span hundreds of kilobases of highly repetitive DNA (Supplementary Fig. 1). We focus on the haploid satellite array present on the Y centromere (DYZ3), as it is particularly suitable for assembly owing to its tractable size, well-characterized HOR structure, and previous physical mapping data6–8. We devised a transposase-based method that we named ‘longboard strategy’ to produce high-read coverage of full-length bacterial artificial chromosome (BAC) DNA with nanopore sequencing (MinION sequencing device, Mk1B, Oxford Nanopore Technologies). In our revealed an increase in megabase yield for full-length BAC DNA sequences (Fig. 1b and Supplementary Fig. 2). We present more than 3,500 full-length ‘1D’ reads (that is, one strand of the DNA is sequenced) from ten BACs (two control BACs from Xq24 and Yp11.2; eight BACs in the DYZ3 locus9; Supplementary Table 1). Correct assembly across the centromeric locus requires overlap among a few sequence variants, meaning that accuracy of base-calls is important. Individual reads (MinION R9.4 chemistry, Albacore v1.1.1) provide insufficient sequence identity (median alignment identity of 84.8% for control BAC, RP11-482A22 reads) to ensure correct repeat assembly10. To improve overall base quality, we produced a consensus sequence from 10 iterations of 60 randomly sampled alignments of full-length 1D reads that spanned the full insert length for each BAC (Fig. 1c). To polish sequences, we realigned full-length nanopore reads to each BAC-derived consensus (99.2% observed for control BAC, RP11-482A22; and an observed range of 99.4–99.8% for vector sequences in DYZ3-containing BACs). To provide a truth set of array sequence variants and to evaluate any inherent nanopore sequence biases, we used Illumina BAC resequencing (Online Methods). We used eight BAC-polished sequences (e.g., 209 kb for RP11-718M18; Fig. 1d) to guide the ordered assembly of BACs from p-arm to q-arm, which includes an entire Y centromere. We ordered the DYZ3-containing BACs using 16 Illumina-validated HOR variants, resulting in 365 kb of assembled alpha satellite DNA (Fig. 2a and Supplementary Data 1). The centromeric locus contains a 301-kb array that is composed of the DYZ3 HOR, with a 5.8-kb consensus sequence, repeated in a head-to-tail orientation without repeat inversions or transposable element interruptions6,11,12. The assembled length of the RP11 DYZ3 array is consistent with estimates for 96 individuals from the same Y haplogroup (R1b) (Supplementary Fig. 3; mean: 315 kb; median: 350 kb)13,14. This finding is in agreement with pulsed-field gel electrophoresis (PFGE) DYZ3 size estimates from previous physical maps, and from a Y-haplogroup matched cell line (Supplementary Fig. 4). Pairwise comparisons among the 52 HORs in the assembled DYZ3 array revealed limited sequence divergence between copies (mean 99.7% pairwise identity). In agreement with a previous assessment of sequence variation within the DYZ3 array6, we detected instances of a 6.0-kb HOR structural variant and provide evidence for seven copies within the RP11 DYZ3 array that were present in two clusters separated by 110 kb, as roughly predicted by previous restriction map estimates8. Sequence characterization of the DYZ3 array revealed nine HOR haplotypes, defined by linkage between variant bases that are 1UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA. 2Oxford Nanopore Technologies, Oxford, UK. 3Duke Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, USA. 4Geisinger National, Bethesda, Maryland, USA. 5These authors contributed equally to this work. Correspondence should be addressed to K.H.M. (). Received 8 August 2017; accepted 22 February 2018; published online 19 March 2018; doi:10.1038/nbt.4109 nature biotechnology VOLUME 36 NUMBER 4 APRIL 2018 321 b r i e f c o m m u n i c at i o n s a b 40 35 Circular BAC Linearize BAC and addition of transposase adaptors Ligation of sequencing adaptors and tether attachment Number of bases (Mb) Vector MinlON sequencing 25 20 15 10 5 Transposome complex c 30 0 10 50 100 150 Read length (kb) 200 250 d RPC1-11 BAC insert Sampled reads n = 60 RP11-718M18 209 kb Polishing Final high-quality consensus sequence Figure 1 BAC-based longboard nanopore sequencing strategy on the MinION. (a) Optimized strategy to cut each circular BAC once with transposase results in a linear and complete DNA fragment of the BAC for nanopore sequencing. (b) Yield plot of BAC DNA (RP11-648J18). (c) High-quality BAC consensus sequences were generated by multiple alignment of 60 full-length 1D reads (shown as blue and yellow for both orientations), sampled at rand (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/nbt.4109.pdf
Article home page: https://www.nature.com/articles/nbt.4109

Miten Jain, Hugh E Olsen, Daniel J Turner, David Stoddart, Kira V Bulazel, Benedict Paten, David Haussler, Huntington F Willard, Mark Akeson, Karen H Miga. Linear assembly of a human centromere on the Y chromosome, Nature Biotechnology, 2018, pp. 321-323, Issue: 36, DOI: 10.1038/nbt.4109