SASI-Seq: sample assurance Spike-Ins, and highly differentiating 384 barcoding for Illumina sequencing
Michael A Quail
1
Miriam Smith
1
David Jackson
1
Steven Leonard
1
Thomas Skelly
0
Harold P Swerdlow
1
Yong Gu
1
Peter Ellis
1
0
Leidos Biomedical Research, Frederick National Laboratory for Cancer Research
,
Bldg. 427, 21702-1201 Frederick, MD
,
USA
1
Wellcome Trust Sanger Institute
,
Hinxton CB10 1SA, Cambs
,
UK
-
Seed incoming samples with unique combinations of spike-in fragments.
Sequence and identify spikes
Library prep +/- size selection
Open Access
SASI-Seq: sample assurance Spike-Ins, and highly
differentiating 384 barcoding for Illumina
sequencing
Background: A minor but significant fraction of samples subjected to next-generation sequencing methods are
either mixed-up or cross-contaminated. These events can lead to false or inconclusive results. We have therefore
developed SASI-Seq; a process whereby a set of uniquely barcoded DNA fragments are added to samples destined
for sequencing. From the final sequencing data, one can verify that all the reads derive from the original sample(s)
and not from contaminants or other samples.
Results: By adding a mixture of three uniquely barcoded amplicons, of different sizes spanning the range of insert
sizes one would normally use for Illumina sequencing, at a spike-in level of approximately 0.1%, we demonstrate
that these fragments remain intimately associated with the sample. They can be detected following even the
tightest size selection regimes or exome enrichment and can report the occurrence of sample mix-ups and
cross-contamination.
As a consequence of this work, we have designed a set of 384 eleven-base Illumina barcode sequences that are at
least 5 changes apart from each other, allowing for single-error correction and very low levels of barcode
misallocation due to sequencing error.
Conclusion: SASI-Seq is a simple, inexpensive and flexible tool that enables sample assurance, allows deconvolution of
sample mix-ups and reports levels of cross-contamination between samples throughout NGS workflows.
Background
As NGS matures and sequence yields increase, the scale
of sequencing projects being undertaken is ever increasing.
There are now many sequencing projects tackling
thousands, or tens of thousands of samples; e.g., the UK10K
project (www.uk10k.org) and the malaria genome
consortium [1]. Large sample numbers from both case and
control sets are commonly being sequenced in order to
detect rare alleles that are associated with disease.
Sample contamination and mix-ups are a serious problem,
and can interfere with the sensitive statistical methods
being used to determine such causal variants [2-7]. Whilst
* Correspondence:
1Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambs, UK
Full list of author information is available at the end of the article
laboratories can implement elaborate tracking procedures
involving barcoding and automated handling, sample
swaps, plate swaps, and cross-contamination can still
occur [8,9]. Recent analyses using coxI phylogenetic
relationships suggest that up to 5% error may exist in
sequence database entries [10], but do not have the
power to determine the cause of that error. In the human
genome project, clone identity could be verified by
crossmatching in-silico digestion patterns of the final sequence
against DNA fingerprinting information generated during
physical map construction [11]. In the 1000 genome
project [12] sample identity was verified by comparison
of sequence variation to the Hap Map database
information for the corresponding sample and bioinformatics
tools were written to assess levels of cross-contamination
(e.g. ContEST [13] and subsequently VerifyBAM [7]).
These approaches however are expensive, requiring
significant work that may preclude their use for larger
sequencing projects and fast turn-around clinical
sequencing projects. Furthermore, they are sometimes not
sensitive enough to unambiguously identify a sample and they
report only the bulk properties of a sample and would
therefore not be able to report minor cross-contamination
events.
Thus, we have conceived SASI-Seq (Sample Assurance
Spike-In sequencing) whereby uniquely barcoded DNA
fragments are spiked into samples at the onset. A given
SASI tag will stay intimately associated with a sample as it
is processed through library preparation and sequencing
set-up (Figure 1). The sequence of that tag will be read at
the same time that a sample is sequenced, thus allowing
unambiguous identification of a sample by virtue of its
reported SASI tag sequence. The spike-in can be done at
low levels that would nonetheless generate a large enough
number of reads to enable identification of minor
contaminants. The idea of spiked-in fragments is not new; ERCC
RNA spike-ins [14] are routinely used to normalise RNA
expression levels between different experiments,
combinations of primer pairs specifying control fragments of
defined length have been advocated for genotyping studies
[15] and Illumina include optional spike-in fragments, to
Barcodes sequences (384 choices)
diagnose the efficiency of library preparation steps, within
their TruSeq kits.
The present method, however, has much wider utility,
providing assurance that a sequence has come from the
correct sample. Without such assurance, sample swaps
and cross-contamination often go unnoticed, resulting in
erroneous or confusing results, both of which could be
disastrous for clinical sequencing applications.
With the introduction of massively parallel
nextgeneration sequencing technologies came the realisation
that a single sequencing run often yielded too many reads,
particularly for smaller genomes and amplicons. Methods
were developed to multiplex samples, involving the
addition of a different unique short barcode sequence to
each sample during library preparation. Subsequently, they
could be mixed, sequenced together and the reads correctly
attributed to the appropriate sample by binning reads
containing the same barcode sequence. This practice was
first reported for Roche 454 sequencing [16,17], and soon
after for the Illumina platform [18]. As sequencing yields
have risen higher, the degree of multiplexing has also
risen, with Kozarewa and Turner (2011) reporting a set of
96 barcodes [19], Caporaso et al., (2012) describing a set
of 2167 barcodes [20] and Costea et al., (2013) developing
the software tool TagGD that can design up to 20,000-plex
barcode sets [21], for use in Illumina sequencing. These
Seed incoming samples with unique combinations of spike-in fragments.
Sequence and identify spikes
Library prep +/-size selection
Figure 1 Diagrammatic representation of the SASI-Seq process. Amplicons of a reference sequence (here we use PhiX174) are generated
with unique barcodes at their 5 end. Sets of amplicons with different barcodes are added to each sample that is destined for sequencing. The
SASI fragments stay with the sample through library prep and can be detected after sequencing. SASI-Seq thus verifies which sample the sequence
d (...truncated)