Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination
Filip Van Nieuwerburgh
2
Ryan C. Thompson
1
Jessica Ledesma
0
Dieter Deforce
2
Terry Gaasterland
1
Phillip Ordoukhanian
0
Steven R. Head
0
0
Next Generation Sequencing Core,
The Scripps Research Institute
,
La Jolla, CA 92037, USA
1
Laboratory of Computational Genomics, Marine Biology Research Division, Scripps Institution of Oceanography
, UCSD
2
Laboratory of Pharmaceutical Biotechnology, Ghent University
, Harelbekestraat 72, 9000 Ghent,
Belgium
Standard Illumina mate-paired libraries are constructed from 3- to 5-kb DNA fragments by a blunt-end circularization. Sequencing reads that pass through the junction of the two joined ends of a 3-5-kb DNA fragment are not easy to identify and pose problems during mapping and de novo assembly. Longer read lengths increase the possibility that a read will cross the junction. To solve this problem, we developed a mate-paired protocol for use with Illumina sequencing technology that uses Cre-Lox recombination instead of blunt end circularization. In this method, a LoxP sequence is incorporated at the junction site. This sequence allows screening reads for junctions without using a reference genome. Junction reads can be trimmed or split at the junction. Moreover, the location of the LoxP sequence in the reads distinguishes mate-paired reads from spurious paired-end reads. We tested this new method by preparing and sequencing a mate-paired library with an insert size of 3 kb from Saccharomyces cerevisiae. We present an analysis of the library quality statistics and a new bio-informatics tool called DeLoxer that can be used to analyze an IlluminaCre-Lox matepaired data set. We also demonstrate how the resulting data significantly improves a de novo assembly of the S. cerevisiae genome.
-
Paired-end and mate-paired sequencing libraries both are
methodologies that, in addition to sequence information,
give information about the physical distance between the
two reads in the reference genome. The ability to map
reads to a reference using distance information is useful
to resolve larger structural rearrangements (insertions,
deletions, inversions). Distance information also has a
major impact on the overall success of de novo assembly
with short reads, helping to assemble across repetitive
regions: if one read cannot be mapped because it falls in
a highly repetitive region, but the paired read is unique,
the distance information can be used to map both reads.
When the two reads of a pair can be mapped to two
different contiguous sequences from an assembly (contigs),
they specify the contigs order, orientation and
approximate distance in the genome. This ability greatly facilitates
de novo genome assembly of complex organisms. The
difference between paired-end and mate-paired is typically
that mate-paired is used to indicate a longer insert size
compared to paired-end, with insert sizes measuring
between 2 and 20 kb.
Illumina mate-paired libraries
Illumina mate-paired libraries are constructed from 3- to
5-kb DNA fragments by a blunt-end circularization and
a secondary fragmentation step (1). A biotin molecule
on the circularization junction is used to enrich for
fragments containing the junction. Still, a typical Illumina
mate-paired library will have fragments that lack the
junction and map as paired-end reads with short inserts.
When sequencing a mate-paired library, Illumina
recommends a read length no longer than 36 bases. Although
short reads are not ideal in de novo assembly of genomes
with a high repeat content or when looking for structural
variations, the 36-bp limit aims to decrease the possibility
that a sequence read will pass through the junction of
the two joined ends of a 3- to 5-kb DNA fragment.
When using standard mapping software like the Illumina
pipeline, such junction reads are discarded, since they
would not align to the reference sequence. To map
junction reads, specifically adapted software like the
Novoalign mate-paired algorithm can be used to detect
junction reads and split the read at the junction.
Junction reads are problematic for de novo assembly
software, where they can reduce the performance of the
assembly. To further reduce the number of junction reads,
Illumina recommends a final library size range selection
of 400600 bp, which is larger than a typical paired-end
library of 200300 bp. Increasing the size range of the
library in the mate-paired protocol minimizes the
number of sequence reads that will pass through a
junction.
Roche GS-FLX paired-end libraries
In Roche GS-FLX library preparation, the 3- to 20-kb
DNA fragments are circularized by a
Crerecombinasemediated recombination event between LoxP sites, which
are added to both ends of the fragment by ligating
circularization adapters (2). The resulting circularized DNA
molecules bear one recombined biotinylated
circularization adapter sequence at the junction site. This LoxP
sequence makes it possible to detect the junction
computationally in paired read and split them at the junction
without mapping to a reference sequence.
Illumina mate-paired libraries using Cre-Lox
To sequence Illumina mate-paired libraries with a read
length >36 bp without running into the problem of a
high percentage of unusable junction reads, we adapted
the Illumina mate-paired protocol to use Cre-Lox
recombination instead of blunt end circularization. In this way, a
Cre-Lox sequence is incorporated between both joined
ends at the junction site. This sequence allows screening
for junction reads and makes it possible to trim or split
those reads at the junction. We tested this new method by
preparing and sequencing a mate-paired library with an
insert size of 3 kb from Saccharomyces cerevisiae DNA.
We present an analysis of the library quality statistics:
ratio of mate-paired reads versus paired-end reads,
number of junction reads, fragment size statistics, yield
of usable mate-paired bases and library diversity. We
show that all of the read pairs identified as mate-pairs
map to the reference genome with a mean distance of
3 kb. We also present a bioinformatics tool that can be
used to analyze an IlluminaCre-Lox mate-paired data set
and to produce FASTQ files containing categorized
mate-paired, paired-end and LoxP negative reads which
are split or trimmed at the junction site to eliminate LoxP
adapter sequences. Finally, we show how the sequencing
data resulting from the library improves a de novo
assembly of the S. cerevisiae genome.
The IlluminaCre-Lox mate-paired library preparation
protocol presented here is similar to the Illumina Mate
Pair Library v2 Sample Preparation Guide for 25 kb
libraries (1). The first part of this protocol was modified
to allow for Cre-Lox recombination instead of blunt end
circularization. The protocol was also changed to achieve
a higher yield of DNA that can be used in the PCR library
amplification step. Doing so allows for using fewer PCR
cycles, increasing library diversity and reducing PCR bias.
Instead of nebulization o (...truncated)