SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing
Barcodes sequences (384 choices)
FC
FB
FA
phiX
R
PCR + clean up
214 bp
397 bp
568 bp
Seed incoming samples with unique combinations of spike-in fragments.
Sequence and identify spikes
Library prep +/- size selection
SASI-Seq: sample assurance Spike-Ins, and highly
differentiating 384 barcoding for Illumina
sequencing
Quail et al.
Quail et al. BMC Genomics 2014, 15:110
http://www.biomedcentral.com/1471-2164/15/110
Quail et al. BMC Genomics 2014, 15:110
http://www.biomedcentral.com/1471-2164/15/110
METHODOLOGY ARTICLE
Open Access
SASI-Seq: sample assurance Spike-Ins, and highly
differentiating 384 barcoding for Illumina
sequencing
Michael A Quail1*, Miriam Smith1, David Jackson1, Steven Leonard1, Thomas Skelly2, Harold P Swerdlow1,
Yong Gu1 and Peter Ellis1
Abstract
Background: A minor but significant fraction of samples subjected to next-generation sequencing methods are
either mixed-up or cross-contaminated. These events can lead to false or inconclusive results. We have therefore
developed SASI-Seq; a process whereby a set of uniquely barcoded DNA fragments are added to samples destined
for sequencing. From the final sequencing data, one can verify that all the reads derive from the original sample(s)
and not from contaminants or other samples.
Results: By adding a mixture of three uniquely barcoded amplicons, of different sizes spanning the range of insert
sizes one would normally use for Illumina sequencing, at a spike-in level of approximately 0.1%, we demonstrate
that these fragments remain intimately associated with the sample. They can be detected following even the
tightest size selection regimes or exome enrichment and can report the occurrence of sample mix-ups and
cross-contamination.
As a consequence of this work, we have designed a set of 384 eleven-base Illumina barcode sequences that are at
least 5 changes apart from each other, allowing for single-error correction and very low levels of barcode
misallocation due to sequencing error.
Conclusion: SASI-Seq is a simple, inexpensive and flexible tool that enables sample assurance, allows deconvolution of
sample mix-ups and reports levels of cross-contamination between samples throughout NGS workflows.
Keywords: Next-generation sequencing, Indexing, Barcode, Illumina, Sample assurance, Spike-in, Contamination,
Sample identity
Background
As NGS matures and sequence yields increase, the scale
of sequencing projects being undertaken is ever increasing.
There are now many sequencing projects tackling thousands, or tens of thousands of samples; e.g., the UK10K
project (www.uk10k.org) and the malaria genome consortium [1]. Large sample numbers from both case and
control sets are commonly being sequenced in order to
detect rare alleles that are associated with disease.
Sample contamination and mix-ups are a serious problem,
and can interfere with the sensitive statistical methods
being used to determine such causal variants [2-7]. Whilst
* Correspondence:
1
Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambs, UK
Full list of author information is available at the end of the article
laboratories can implement elaborate tracking procedures
involving barcoding and automated handling, sample
swaps, plate swaps, and cross-contamination can still
occur [8,9]. Recent analyses using coxI phylogenetic
relationships suggest that up to 5% error may exist in
sequence database entries [10], but do not have the
power to determine the cause of that error. In the human
genome project, clone identity could be verified by crossmatching in-silico digestion patterns of the final sequence
against DNA fingerprinting information generated during
physical map construction [11]. In the 1000 genome
project [12] sample identity was verified by comparison
of sequence variation to the Hap Map database information for the corresponding sample and bioinformatics
tools were written to assess levels of cross-contamination
© 2014 Quail et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication
waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise
stated.
Quail et al. BMC Genomics 2014, 15:110
http://www.biomedcentral.com/1471-2164/15/110
Page 2 of 12
(e.g. ContEST [13] and subsequently VerifyBAM [7]).
These approaches however are expensive, requiring significant work that may preclude their use for larger
sequencing projects and fast turn-around clinical sequencing projects. Furthermore, they are sometimes not sensitive enough to unambiguously identify a sample and they
report only the bulk properties of a sample and would
therefore not be able to report minor cross-contamination
events.
Thus, we have conceived SASI-Seq (Sample Assurance
Spike-In sequencing) whereby uniquely barcoded DNA
fragments are spiked into samples at the onset. A given
SASI tag will stay intimately associated with a sample as it
is processed through library preparation and sequencing
set-up (Figure 1). The sequence of that tag will be read at
the same time that a sample is sequenced, thus allowing
unambiguous identification of a sample by virtue of its
reported SASI tag sequence. The spike-in can be done at
low levels that would nonetheless generate a large enough
number of reads to enable identification of minor contaminants. The idea of spiked-in fragments is not new; ERCC
RNA spike-ins [14] are routinely used to normalise RNA
expression levels between different experiments, combinations of primer pairs specifying control fragments of
defined length have been advocated for genotyping studies
[15] and Illumina include optional spike-in fragments, to
diagnose the efficiency of library preparation steps, within
their TruSeq kits.
The present method, however, has much wider utility,
providing assurance that a sequence has come from the
correct sample. Without such assurance, sample swaps
and cross-contamination often go unnoticed, resulting in
erroneous or confusing results, both of which could be
disastrous for clinical sequencing applications.
With the introduction of massively parallel nextgeneration sequencing technologies came the realisation
that a single sequencing run often yielded too many reads,
particularly for smaller genomes and amplicons. Methods
were developed to multiplex samples, involving the
addition of a different unique short barcode sequence to
each sample during library preparation. Subsequently, they
could be mixed, sequenced together and the reads correctly
attributed to the appropriate sample by binning reads
containing the same barcode sequence. This practice was
first reported for Roche 454 sequencing [16,17], and soon
after for the Illumina platform [18]. As s (...truncated)