Linear assembly of a human centromere on the Y chromosome
B r i e f c o m m u n i c at i o n s
OPEN
© 2018 Nature America, Inc., part of Springer Nature. All rights reserved.
longboard strategy, we linearize the circular BAC with a single cut
Linear assembly of a human
site, then add sequencing adaptors (Fig. 1a). The BAC DNA passes
the pore, resulting in complete, end-to-end sequence covercentromere on the Y chromosome through
age of the entire insert. Plots of read length versus megabase yield
Miten Jain1,5 , Hugh E Olsen1,5, Daniel J Turner2,
David Stoddart2, Kira V Bulazel3, Benedict Paten1,
David Haussler1, Huntington F Willard3,4, Mark Akeson1
& Karen H Miga1,3
The human genome reference sequence remains incomplete
owing to the challenge of assembling long tracts of nearidentical tandem repeats in centromeres. We implemented a
nanopore sequencing strategy to generate high-quality reads
that span hundreds of kilobases of highly repetitive DNA in a
human Y chromosome centromere. Combining these data with
short-read variant validation, we assembled and characterized
the centromeric region of a human Y chromosome.
Centromeres facilitate spindle attachment and ensure proper chromosome segregation during cell division. Normal human centromeres
are enriched with AT-rich ~171-bp tandem repeats known as alpha
satellite DNA1. Most alpha satellite DNAs are organized into higher
order repeats (HORs), in which chromosome-specific alpha satellite
repeat units, or monomers, are reiterated as a single repeat structure
hundreds or thousands of times with high (>99%) sequence conservation to form extensive arrays2. Characterizing both the sequence
composition of individual HOR structures and the extent of repeat
variation is crucial to understanding kinetochore assembly and centromere identity3–5. However, no sequencing technology (including
single-molecule real-time (SMRT) sequencing or synthetic long-read
technologies) or a combination of sequencing technologies has been
able to assemble centromeric regions because extremely high-quality,
long reads are needed to confidently traverse low-copy sequence variants. As a result, human centromeric regions remain absent from even
the most complete chromosome assemblies.
Here we apply nanopore long-read sequencing to produce highquality reads that span hundreds of kilobases of highly repetitive
DNA (Supplementary Fig. 1). We focus on the haploid satellite array
present on the Y centromere (DYZ3), as it is particularly suitable for
assembly owing to its tractable size, well-characterized HOR structure, and previous physical mapping data6–8.
We devised a transposase-based method that we named ‘longboard
strategy’ to produce high-read coverage of full-length bacterial artificial chromosome (BAC) DNA with nanopore sequencing (MinION
sequencing device, Mk1B, Oxford Nanopore Technologies). In our
revealed an increase in megabase yield for full-length BAC DNA
sequences (Fig. 1b and Supplementary Fig. 2). We present more
than 3,500 full-length ‘1D’ reads (that is, one strand of the DNA is
sequenced) from ten BACs (two control BACs from Xq24 and Yp11.2;
eight BACs in the DYZ3 locus9; Supplementary Table 1).
Correct assembly across the centromeric locus requires overlap
among a few sequence variants, meaning that accuracy of base-calls
is important. Individual reads (MinION R9.4 chemistry, Albacore
v1.1.1) provide insufficient sequence identity (median alignment
identity of 84.8% for control BAC, RP11-482A22 reads) to ensure correct repeat assembly10. To improve overall base quality, we produced
a consensus sequence from 10 iterations of 60 randomly sampled
alignments of full-length 1D reads that spanned the full insert length
for each BAC (Fig. 1c). To polish sequences, we realigned full-length
nanopore reads to each BAC-derived consensus (99.2% observed for
control BAC, RP11-482A22; and an observed range of 99.4–99.8%
for vector sequences in DYZ3-containing BACs). To provide a truth
set of array sequence variants and to evaluate any inherent nanopore sequence biases, we used Illumina BAC resequencing (Online
Methods). We used eight BAC-polished sequences (e.g., 209 kb for
RP11-718M18; Fig. 1d) to guide the ordered assembly of BACs from
p-arm to q-arm, which includes an entire Y centromere.
We ordered the DYZ3-containing BACs using 16 Illumina-validated
HOR variants, resulting in 365 kb of assembled alpha satellite DNA
(Fig. 2a and Supplementary Data 1). The centromeric locus contains
a 301-kb array that is composed of the DYZ3 HOR, with a 5.8-kb consensus sequence, repeated in a head-to-tail orientation without repeat
inversions or transposable element interruptions6,11,12. The assembled
length of the RP11 DYZ3 array is consistent with estimates for 96 individuals from the same Y haplogroup (R1b) (Supplementary Fig. 3;
mean: 315 kb; median: 350 kb)13,14. This finding is in agreement with
pulsed-field gel electrophoresis (PFGE) DYZ3 size estimates from
previous physical maps, and from a Y-haplogroup matched cell line
(Supplementary Fig. 4).
Pairwise comparisons among the 52 HORs in the assembled DYZ3
array revealed limited sequence divergence between copies (mean
99.7% pairwise identity). In agreement with a previous assessment
of sequence variation within the DYZ3 array6, we detected instances
of a 6.0-kb HOR structural variant and provide evidence for seven
copies within the RP11 DYZ3 array that were present in two clusters
separated by 110 kb, as roughly predicted by previous restriction map
estimates8. Sequence characterization of the DYZ3 array revealed nine
HOR haplotypes, defined by linkage between variant bases that are
1UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA. 2Oxford Nanopore Technologies, Oxford, UK. 3Duke Institute for Genome
Sciences and Policy, Duke University, Durham, North Carolina, USA. 4Geisinger National, Bethesda, Maryland, USA. 5These authors contributed equally to this work.
Correspondence should be addressed to K.H.M. ().
Received 8 August 2017; accepted 22 February 2018; published online 19 March 2018; doi:10.1038/nbt.4109
nature biotechnology VOLUME 36
NUMBER 4
APRIL 2018
321
b r i e f c o m m u n i c at i o n s
a
b 40
35
Circular
BAC
Linearize BAC
and addition of
transposase adaptors
Ligation of
sequencing
adaptors
and tether attachment
Number of bases (Mb)
Vector
MinlON
sequencing
25
20
15
10
5
Transposome complex
c
30
0
10
50
100
150
Read length (kb)
200
250
d
RPC1-11 BAC insert
Sampled reads
n = 60
RP11-718M18
209 kb
Polishing
Final high-quality consensus sequence
Figure 1 BAC-based longboard nanopore sequencing strategy on the MinION. (a) Optimized strategy to cut each circular BAC once with transposase results
in a linear and complete DNA fragment of the BAC for nanopore sequencing. (b) Yield plot of BAC DNA (RP11-648J18). (c) High-quality BAC consensus
sequences were generated by multiple alignment of 60 full-length 1D reads (shown as blue and yellow for both orientations), sampled at rand (...truncated)