SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2164-15-110.pdf

SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing

Michael A Quail 1 Miriam Smith 1 David Jackson 1 Steven Leonard 1 Thomas Skelly 0 Harold P Swerdlow 1 Yong Gu 1 Peter Ellis 1 0 Leidos Biomedical Research, Frederick National Laboratory for Cancer Research , Bldg. 427, 21702-1201 Frederick, MD , USA 1 Wellcome Trust Sanger Institute , Hinxton CB10 1SA, Cambs , UK - Seed incoming samples with unique combinations of spike-in fragments. Sequence and identify spikes Library prep +/- size selection Open Access SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing Background: A minor but significant fraction of samples subjected to next-generation sequencing methods are either mixed-up or cross-contaminated. These events can lead to false or inconclusive results. We have therefore developed SASI-Seq; a process whereby a set of uniquely barcoded DNA fragments are added to samples destined for sequencing. From the final sequencing data, one can verify that all the reads derive from the original sample(s) and not from contaminants or other samples. Results: By adding a mixture of three uniquely barcoded amplicons, of different sizes spanning the range of insert sizes one would normally use for Illumina sequencing, at a spike-in level of approximately 0.1%, we demonstrate that these fragments remain intimately associated with the sample. They can be detected following even the tightest size selection regimes or exome enrichment and can report the occurrence of sample mix-ups and cross-contamination. As a consequence of this work, we have designed a set of 384 eleven-base Illumina barcode sequences that are at least 5 changes apart from each other, allowing for single-error correction and very low levels of barcode misallocation due to sequencing error. Conclusion: SASI-Seq is a simple, inexpensive and flexible tool that enables sample assurance, allows deconvolution of sample mix-ups and reports levels of cross-contamination between samples throughout NGS workflows. Background As NGS matures and sequence yields increase, the scale of sequencing projects being undertaken is ever increasing. There are now many sequencing projects tackling thousands, or tens of thousands of samples; e.g., the UK10K project (www.uk10k.org) and the malaria genome consortium [1]. Large sample numbers from both case and control sets are commonly being sequenced in order to detect rare alleles that are associated with disease. Sample contamination and mix-ups are a serious problem, and can interfere with the sensitive statistical methods being used to determine such causal variants [2-7]. Whilst * Correspondence: 1Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambs, UK Full list of author information is available at the end of the article laboratories can implement elaborate tracking procedures involving barcoding and automated handling, sample swaps, plate swaps, and cross-contamination can still occur [8,9]. Recent analyses using coxI phylogenetic relationships suggest that up to 5% error may exist in sequence database entries [10], but do not have the power to determine the cause of that error. In the human genome project, clone identity could be verified by crossmatching in-silico digestion patterns of the final sequence against DNA fingerprinting information generated during physical map construction [11]. In the 1000 genome project [12] sample identity was verified by comparison of sequence variation to the Hap Map database information for the corresponding sample and bioinformatics tools were written to assess levels of cross-contamination (e.g. ContEST [13] and subsequently VerifyBAM [7]). These approaches however are expensive, requiring significant work that may preclude their use for larger sequencing projects and fast turn-around clinical sequencing projects. Furthermore, they are sometimes not sensitive enough to unambiguously identify a sample and they report only the bulk properties of a sample and would therefore not be able to report minor cross-contamination events. Thus, we have conceived SASI-Seq (Sample Assurance Spike-In sequencing) whereby uniquely barcoded DNA fragments are spiked into samples at the onset. A given SASI tag will stay intimately associated with a sample as it is processed through library preparation and sequencing set-up (Figure 1). The sequence of that tag will be read at the same time that a sample is sequenced, thus allowing unambiguous identification of a sample by virtue of its reported SASI tag sequence. The spike-in can be done at low levels that would nonetheless generate a large enough number of reads to enable identification of minor contaminants. The idea of spiked-in fragments is not new; ERCC RNA spike-ins [14] are routinely used to normalise RNA expression levels between different experiments, combinations of primer pairs specifying control fragments of defined length have been advocated for genotyping studies [15] and Illumina include optional spike-in fragments, to Barcodes sequences (384 choices) diagnose the efficiency of library preparation steps, within their TruSeq kits. The present method, however, has much wider utility, providing assurance that a sequence has come from the correct sample. Without such assurance, sample swaps and cross-contamination often go unnoticed, resulting in erroneous or confusing results, both of which could be disastrous for clinical sequencing applications. With the introduction of massively parallel nextgeneration sequencing technologies came the realisation that a single sequencing run often yielded too many reads, particularly for smaller genomes and amplicons. Methods were developed to multiplex samples, involving the addition of a different unique short barcode sequence to each sample during library preparation. Subsequently, they could be mixed, sequenced together and the reads correctly attributed to the appropriate sample by binning reads containing the same barcode sequence. This practice was first reported for Roche 454 sequencing [16,17], and soon after for the Illumina platform [18]. As sequencing yields have risen higher, the degree of multiplexing has also risen, with Kozarewa and Turner (2011) reporting a set of 96 barcodes [19], Caporaso et al., (2012) describing a set of 2167 barcodes [20] and Costea et al., (2013) developing the software tool TagGD that can design up to 20,000-plex barcode sets [21], for use in Illumina sequencing. These Seed incoming samples with unique combinations of spike-in fragments. Sequence and identify spikes Library prep +/-size selection Figure 1 Diagrammatic representation of the SASI-Seq process. Amplicons of a reference sequence (here we use PhiX174) are generated with unique barcodes at their 5 end. Sets of amplicons with different barcodes are added to each sample that is destined for sequencing. The SASI fragments stay with the sample through library prep and can be detected after sequencing. SASI-Seq thus verifies which sample the sequence d (...truncated)