A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome
Chapman et al. Genome Biology
A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome
Jarrod A Chapman 0
Martin Mascher
Aydn Bulu
Kerrie Barry 0
Evangelos Georganas
Adam Session 1
Veronika Strnadova
Jerry Jenkins 0
Sunish Sehgal
Leonid Oliker
Jeremy Schmutz 0
Katherine A Yelick
Uwe Scholz
Robbie Waugh
Jesse A Poland
Gary J Muehlbauer
Nils Stein
Daniel S Rokhsar 0 1
0 Department of Energy Joint Genome Institute , 2800 Mitchell Drive, Walnut Creek, CA 94598 , USA
1 Department of Molecular and Cell Biology, University of California , Berkeley, CA 94720 , USA
Polyploid species have long been thought to be recalcitrant to whole-genome assembly. By combining high-throughput sequencing, recent developments in parallel computing, and genetic mapping, we derive, de novo, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and assign 7.1 Gb of this assembly to chromosomal locations. The genome representation and accuracy of our assembly is comparable or even exceeds that of a chromosome-by-chromosome shotgun assembly. Our assembly and mapping strategy uses only short read sequencing technology and is applicable to any species where it is possible to construct a mapping population.
-
Background
The feasibility of whole-genome shotgun (WGS) assembly
of large and complex eukaryotic genomes was once
a much-debated question [1,2]. The advent of
nextgeneration sequencing and the comparative ease and
speed with which WGS assemblies can be constructed for
mammalian and many other genomes allowed sequencing
projects to move beyond these concerns, accepting high
quality draft genomes with nearly complete gene spaces.
Some genomes, however, are larger and more complex
than the typical mammalian genome, including those of
salamanders (>20 gigabases (Gbp)) [3], hexaploid wheat
(16 Gbp) [4,5], and conifers (20 Gbp) [6]. To mitigate
some of the computational challenges of genome assembly
from short next-generation sequencing reads for these
more complex genomes, various divide and conquer
strategies have been developed. These strategies include
chromosome sorting and capture [5], large-insert-clone
pooling [6,7], and large-clone tiling paths [5,8]. While each
approach reduces the sequence assembly problem to a set
of smaller, more tractable problems, they require
substantial resource development in advance of sequencing.
Many of the arguments against a whole-genome
shotgun [2] remain valid today. WGS assemblies are often
rough drafts consisting of numerous, small contigs with
gaps of unknown size between them. Abundant
transposable elements that often form nested structures are
prone to collapse in WGS assembly, resulting in an
underrepresentation and mis-assembly of repetitive
sequences in the final assembly [9]. The experiences
derived from sequencing large and highly repetitive plant
genomes have made it clear that while WGS assemblies
are typically able to deliver a rough draft of the
nonrepetitive portion of a genome, true reference sequences
with high contiguity and near-complete genome
representation are only accessible following the paradigm of
clone-by-clone-sequencing [10].
Despite their shortcomings, WGS approaches for large
genomes [11] have important advantages that include
(1) simplicity of library preparation and (2) uniformity of
coverage. However, for very large (>10 Gbp), complex or
polyploid genomes substantial computational resources
may be required simply to manage the volume of data,
and to address the challenge of resolving near-identical
genomic sequences that are longer than the scale set
by read length and pairing information. While the
human WGS assembly [12] and other chromosome-scale
mammalian assemblies (for example, mouse [13]) are
computational tours de force, they ultimately rely on
non-sequence data such as physical maps to assemble
the chromosomes. The largest WGS assemblies that
have been attempted to date (Norway spruce [6], white
spruce [14] and loblolly pine [15], all approximately
20 Gbp) remain highly fragmented and are not yet
organized into chromosomes. Importantly, whole genome
assemblies of polyploid genomes have not yet been
attempted. Instead, artificial diploids in the case of
autopolyploids such as potato [16] or the progenitor species
of allopolyploids such as wheat [17,18] and rapeseed
[19] have been sequenced.
Hexaploid bread wheat (Triticum aestivum L., 1C = 16
Gbp, 2n = 6x = 42) is one of the most important
agricultural crops, along with rice and maize. It is widely
believed, however, that the hexaploid wheat genome is
recalcitrant to WGS assembly and genome-wide physical
mapping due to a high repeat content and potential
difficulties in separating homeologous loci in the different
subgenomes, which are not problems with the diploid
rice [20] and maize [21] genomes. An early attempt at a
WGS assembly resulted in a highly fragmented and
genetically unanchored assembly (...truncated)