SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/1471-2164-15-110

SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing

Barcodes sequences (384 choices) FC FB FA phiX R PCR + clean up 214 bp 397 bp 568 bp Seed incoming samples with unique combinations of spike-in fragments. Sequence and identify spikes Library prep +/- size selection SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing Quail et al. Quail et al. BMC Genomics 2014, 15:110 http://www.biomedcentral.com/1471-2164/15/110 Quail et al. BMC Genomics 2014, 15:110 http://www.biomedcentral.com/1471-2164/15/110 METHODOLOGY ARTICLE Open Access SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing Michael A Quail1*, Miriam Smith1, David Jackson1, Steven Leonard1, Thomas Skelly2, Harold P Swerdlow1, Yong Gu1 and Peter Ellis1 Abstract Background: A minor but significant fraction of samples subjected to next-generation sequencing methods are either mixed-up or cross-contaminated. These events can lead to false or inconclusive results. We have therefore developed SASI-Seq; a process whereby a set of uniquely barcoded DNA fragments are added to samples destined for sequencing. From the final sequencing data, one can verify that all the reads derive from the original sample(s) and not from contaminants or other samples. Results: By adding a mixture of three uniquely barcoded amplicons, of different sizes spanning the range of insert sizes one would normally use for Illumina sequencing, at a spike-in level of approximately 0.1%, we demonstrate that these fragments remain intimately associated with the sample. They can be detected following even the tightest size selection regimes or exome enrichment and can report the occurrence of sample mix-ups and cross-contamination. As a consequence of this work, we have designed a set of 384 eleven-base Illumina barcode sequences that are at least 5 changes apart from each other, allowing for single-error correction and very low levels of barcode misallocation due to sequencing error. Conclusion: SASI-Seq is a simple, inexpensive and flexible tool that enables sample assurance, allows deconvolution of sample mix-ups and reports levels of cross-contamination between samples throughout NGS workflows. Keywords: Next-generation sequencing, Indexing, Barcode, Illumina, Sample assurance, Spike-in, Contamination, Sample identity Background As NGS matures and sequence yields increase, the scale of sequencing projects being undertaken is ever increasing. There are now many sequencing projects tackling thousands, or tens of thousands of samples; e.g., the UK10K project (www.uk10k.org) and the malaria genome consortium [1]. Large sample numbers from both case and control sets are commonly being sequenced in order to detect rare alleles that are associated with disease. Sample contamination and mix-ups are a serious problem, and can interfere with the sensitive statistical methods being used to determine such causal variants [2-7]. Whilst * Correspondence: 1 Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambs, UK Full list of author information is available at the end of the article laboratories can implement elaborate tracking procedures involving barcoding and automated handling, sample swaps, plate swaps, and cross-contamination can still occur [8,9]. Recent analyses using coxI phylogenetic relationships suggest that up to 5% error may exist in sequence database entries [10], but do not have the power to determine the cause of that error. In the human genome project, clone identity could be verified by crossmatching in-silico digestion patterns of the final sequence against DNA fingerprinting information generated during physical map construction [11]. In the 1000 genome project [12] sample identity was verified by comparison of sequence variation to the Hap Map database information for the corresponding sample and bioinformatics tools were written to assess levels of cross-contamination © 2014 Quail et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Quail et al. BMC Genomics 2014, 15:110 http://www.biomedcentral.com/1471-2164/15/110 Page 2 of 12 (e.g. ContEST [13] and subsequently VerifyBAM [7]). These approaches however are expensive, requiring significant work that may preclude their use for larger sequencing projects and fast turn-around clinical sequencing projects. Furthermore, they are sometimes not sensitive enough to unambiguously identify a sample and they report only the bulk properties of a sample and would therefore not be able to report minor cross-contamination events. Thus, we have conceived SASI-Seq (Sample Assurance Spike-In sequencing) whereby uniquely barcoded DNA fragments are spiked into samples at the onset. A given SASI tag will stay intimately associated with a sample as it is processed through library preparation and sequencing set-up (Figure 1). The sequence of that tag will be read at the same time that a sample is sequenced, thus allowing unambiguous identification of a sample by virtue of its reported SASI tag sequence. The spike-in can be done at low levels that would nonetheless generate a large enough number of reads to enable identification of minor contaminants. The idea of spiked-in fragments is not new; ERCC RNA spike-ins [14] are routinely used to normalise RNA expression levels between different experiments, combinations of primer pairs specifying control fragments of defined length have been advocated for genotyping studies [15] and Illumina include optional spike-in fragments, to diagnose the efficiency of library preparation steps, within their TruSeq kits. The present method, however, has much wider utility, providing assurance that a sequence has come from the correct sample. Without such assurance, sample swaps and cross-contamination often go unnoticed, resulting in erroneous or confusing results, both of which could be disastrous for clinical sequencing applications. With the introduction of massively parallel nextgeneration sequencing technologies came the realisation that a single sequencing run often yielded too many reads, particularly for smaller genomes and amplicons. Methods were developed to multiplex samples, involving the addition of a different unique short barcode sequence to each sample during library preparation. Subsequently, they could be mixed, sequenced together and the reads correctly attributed to the appropriate sample by binning reads containing the same barcode sequence. This practice was first reported for Roche 454 sequencing [16,17], and soon after for the Illumina platform [18]. As s (...truncated)