Phylogenetic Properties of 50 Nuclear Loci in Medicago (Leguminosae) Generated Using Multiplexed Sequence Capture and Next-Generation Sequencing
October
Phylogenetic Properties of 50 Nuclear Loci in Medicago (Leguminosae) Generated Using Multiplexed Sequence Capture and Next- Generation Sequencing
Filipe de Sousa 0 1 2
Yann J. K. Bertrand 0 1 2
Stephan Nylinder 0 1 2
Bengt Oxelman 0 1 2
Jonna S. Eriksson 0 1 2
Bernard E. Pfeil 0 1 2
0 1 Department of Biological and Environmental Sciences, University of Gothenburg , Gothenburg , Sweden , 2 Department of Botany, Swedish Museum of Natural History , Stockholm , Sweden
1 Data Availability Statement: All relevant data are within the paper and its Supporting Information files and all sequence files are available at the European Nucleotide Archive with accession numbers ERS511665 , ERS511666, ERS511667, ERS511668, ERS511669
2 Academic Editor: Sven Buerki, Royal Botanic Gardens , Kew, UNITED KINGDOM
Next-generation sequencing technology has increased the capacity to generate molecular data for plant biological research, including phylogenetics, and can potentially contribute to resolving complex phylogenetic problems. The evolutionary history of Medicago L. (Leguminosae: Trifoliae) remains unresolved due to incongruence between published phylogenies. Identification of the processes causing this genealogical incongruence is essential for the inference of a correct species phylogeny of the genus and requires that more molecular data, preferably from low-copy nuclear genes, are obtained across different species. Here we report the development of 50 novel LCN markers in Medicago and assess the phylogenetic properties of each marker. We used the genomic resources available for Medicago truncatula Gaertn., hybridisation-based gene enrichment (sequence capture) techniques and Next-Generation Sequencing to generate sequences. This alternative proves to be a cost-effective approach to amplicon sequencing in phylogenetic studies at the genus or tribe level and allows for an increase in number and size of targeted loci. Substitution rate estimates for each of the 50 loci are provided, and an overview of the variation in substitution rates among a large number of low-copy nuclear genes in plants is presented for the first time. Aligned sequences of major species lineages of Medicago and its sister genus are made available and can be used in further probe development for sequence-capture of the same markers.
-
Funding: This work was supported by grants from
the Swedish Research Council, the Royal Swedish
Academy of Sciences (grant 2009-5206), Lars
Hiertas Minne fund, The Royal Physiographic Society
in Lund, Helge Ax:son Johnsons fund, and the
The development and rapidly growing capacity of next-generation sequencing (NGS) has
greatly increased the amount of data generated for research in plant biology. Large datasets of
molecular sequences are now being collected across various model and non-model organisms
by sequencing whole genomes, transcriptomes, or through enrichment of multiple genes at
Competing Interests: The authors have declared
that no competing interests exist.
either specific or anonymous loci [1]. Systematic biology is also set to benefit from these
developments, with several projects having already used NGS to obtain data [25]. However, the
application of NGS in phylogenetics is still in its infancy and far from routine, partly because
there has been no consensus on the choice of sampling strategy [6].
Whole genome sequencing has been used to explore individual variation at the genomic
level in plants [78] but, due to its high price, is not expected to be widely applied for plant
phylogenetic research in the near future. Anonymous locus approaches, such as
restrictionsite-associated (RAD) tags [9] have been successfully used to solve species relationships [3],
[10], but do not always result in good overlap among samples, which may compromise the
overall cost efficiency of these methods. Furthermore, anonymous loci are likely to have higher
levels of paralogy and a short phylogenetic span [11]. Genome skimming approaches [12] can
be used to sequence the high-copy fraction of plant genomes (cpDNA, mtDNA, rDNA) and, to
some extent, to identify nuclear loci, but in the latter case the amount of information obtained
is limited and highly dependent on sequencing depth and genome size.
Hybridisation-based enrichment (or sequence capture), on the other hand, appears to have
great potential to solve these challenges by selecting, a priori, loci of interest, or those that have
suitable parameters for analysis, to generate larger and more informative data sets if compared
to other genomic sampling strategies [13]. Sequence-capture has already been used in
phylogenetics and phylogeography, in both plants and animals [2], [45], [1417] and is likely to
replace PCR as the main target enrichment method in plant sciences [1], [18]. One or more
genomes or transcriptomes are necessary for probe design prior to sequence-capture, but for
groups where a close reference is lacking, protocol modifications can be made to capture targets
that are not phylogenetically close to the reference [19]. Hybridisation-based enrichment can
also overcome the problem of degraded genomic DNA, which is often encountered in
herbarium and museum material [2021]. Multiplexing of indexed DNA libraries for
sequence-capture significantly reduces the amount of work and time required to obtain the same data via
PCR amplification of the target, while also reducing sequencing costs when combined with
NGS platforms such as Illumina [22]. Multiplexing requires that the size of the target is not
excessive, otherwise the read depth (number of reads at a particular site) might be insufficient for
proper contig assembly and variant calling. Furthermore, keeping the targets to moderate sizes
while generating longer sequences (long loci rather than SNP/Rad-tag data) produces more
informative data per locus. Generating large alignments may imply a significant amount of
manual work, but enables the inference of more resolved and robust gene trees and consequently
the correct assessment of gene tree incongruence, for which SNP and Rad-tag data are severely
limited. The cost per base of sequence is vastly lower in NGS than in Sanger sequencing [23]
but the overall investment, especially for sample preparation, is still considerable. Therefore,
instead of relying solely on exploratory sampling of new loci, it is worth also considering
sampling characterised markers that have already been tested for both ease of recovery with
sequence capture methods and suitable sequence variability. Targeting previously employed loci
is especially important in phylogenetics and phylogeography, which require homologous
molecular data from multiple individuals [6], because newly produced sequences can easily be
incorporated into pre-existing phylogenies. As more researchers use the same loci across many
taxa, large phylogenies can be inferred using data sets with much lower proportions of missing
data than is typically the case at presen (...truncated)