Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales
Ke Bi
0
3
Dan Vanderpool
1
3
Sonal Singhal
0
2
3
Tyler Linderoth
0
2
3
Craig Moritz
0
2
3
Jeffrey M Good
1
3
0
Museum of Vertebrate Zoology, University of California, Berkeley
,
3101 Valley Life Sciences Building, Berkeley, CA 94720-3160
,
USA
1
Division of Biological
2
Department of Integrative Biology, University of California, Berkeley
,
1005 Valley Life Sciences Building, Berkeley, CA 94720-3140
,
USA
3
Sciences, University of Montana
,
Missoula, MT 59812
,
USA
Background: To date, exon capture has largely been restricted to species with fully sequenced genomes, which has precluded its application to lineages that lack high quality genomic resources. We developed a novel strategy for designing array-based exon capture in chipmunks (Tamias) based on de novo transcriptome assemblies. We evaluated the performance of our approach across specimens from four chipmunk species. Results: We selectively targeted 11,975 exons (~4 Mb) on custom capture arrays, and enriched over 99% of the targets in all libraries. The percentage of aligned reads was highly consistent (24.4-29.1%) across all specimens, including in multiplexing up to 20 barcoded individuals on a single array. Base coverage among specimens and within targets in each species library was uniform, and the performance of targets among independent exon captures was highly reproducible. There was no decrease in coverage among chipmunk species, which showed up to 1.5% sequence divergence in coding regions. We did observe a decline in capture performance of a subset of targets designed from a much more divergent ground squirrel genome (30 My), however, over 90% of the targets were also recovered. Final assemblies yielded over ten thousand orthologous loci (~3.6 Mb) with thousands of fixed and polymorphic SNPs among species identified. Conclusions: Our study demonstrates the potential of a transcriptome-enabled, multiplexed, exon capture method to create thousands of informative markers for population genomic and phylogenetic studies in non-model species across the tree of life.
-
Background
High-throughput, next generation sequencing (NGS)
technologies and associated bioinformatics tools have
fundamentally changed the scale at which DNA
sequence data can be gathered and analyzed [1]. NGS
allows for a massive amount of sequence data to be
affordably and quickly obtained. In principle, these
approaches can be implemented without prior genomic
knowledge of the focus species, thus offering
tremendous potential for addressing various novel and
longstanding evolutionary questions previously hampered by
technology and cost [2].
NGS allows researchers to investigate genome-wide
molecular, structural, and regulatory mechanisms
underlying adaptation, diversification, and speciation [3]. NGS
also enables comparative genome scans for
polymorphism which can then be used to infer demography and
selection [4]. Molecular phylogenetics also benefits from
the increasing accessibility of NGS. Large-scale,
multilocus data (i.e., hundreds to thousands of loci) combined
with improved analytical tools for inferring gene trees,
provides unprecedented opportunities for resolving
species phylogenies [5]. Toward this end, a core challenge of
population genomic and phylogenetic studies is obtaining
a reliable set of orthologous loci from a sufficient number
of individuals across populations or species spanning a
range of divergences [6]. Even though the cost of NGS
continues to fall, most evolutionary labs cannot sequence
whole genomes or a large portion of genomic regions
from samples spanning divergent clades. Moreover,
whole genome data simply is not necessary to answer
many research questions. In this context, genome
partitioning and targeted re-sequencing of a consistent subset
of genomic regions will remain the most cost-effective
and analytically straightforward approach for most
evolutionary applications. Genome partitioning with targeted
DNA capture allows for the selective NGS of thousands
of genomic regions [7], facilitating rapid assays of genetic
variation. Compared to partitioning methods that search
for anonymous markers (i.e. restriction site associated
DNA tags, or RADtags [8], DNA capture is expected to
be more efficient for finding orthologous markers among
divergent genomes [6,9,10]. When applied to exonic
regions, DNA capture can also provide information on
gene function and evolution. Exon capture involves the
hybridization of genomic libraries to short
oligonucleotide baits complementary to complete or partial exomes
printed on a microarray [7] or attached to magnetic
beads in solution [11]. The captured exon-containing
DNA fragments of individual or pooled genomic libraries
are then eluted from the array and the target-enriched
elute is sequenced using an NGS platform. To date, the
design of exon capture relies heavily on existing high
quality genomic resources (e.g. [12]). However, the
genomes of most organisms of ecological and evolutionary
interest are yet to be sequenced, which has largely
impeded the expansion of DNA capture across the tree
of life.
In this study, we propose a series of methods (Figure 1)
aimed at adapting exon capture based NGS to organisms
without pre-existing reference genomes. Here we
focused on array-based capture but note that the same
general principles should directly extend to an
insolution approach. We focused on North American
chipmunks of the genus Tamias to test our methods.
Tamias are the focus of a comprehensive set of studies
that aim to understand their evolutionary history,
patterns of hybridization, and gene introgression (e.g.,
[13,14]). There is no reference genome currently
available for this group; at the onset of our study the most
closely related genomic resource was a low-coverage
(2X) draft genome of the thirteen-lined ground squirrel
(Ictidomys tridecemlineatus), which is around 30 million
years (My) divergent from Tamias. The house mouse
(Mus musculus) and rat (Rattus norvegicus) are the
closest high-quality reference genomes, but last shared a
common ancestor with chipmunks around 70 My. In
this context, we developed genomic resources by first
sequencing multi-tissue transcriptomes from one
chipmunk species (the alpine chipmunk, Tamias alpinus),
and then designed arrays by targeting a subset of exons
from the annotated transcripts. Furthermore, to test how
Figure 1 An overall work flow of this study. The Tamias phylogenetic tree is modified from [13] by replacing the outgroup species with T.
striatus. The Tamias species that were not under investigation in the present study are not shown.
increased divergence affects capture efficiency, we
included anonymous genomic targets from the
thirteenlined ground squirrel on this array. We then tested the
feasibility of this approach by using these arrays to
capture sequence from four chipmunk species, spanning the
range of genetic divergence in this genus. Up to 20
individually indexed genomic libraries from each species
wer (...truncated)