Sequence verification of synthetic DNA by assembly of sequencing reads
Published online 4 October 2012
Nucleic Acids Research, 2013, Vol. 41, No. 1 e25
doi:10.1093/nar/gks908
Sequence verification of synthetic DNA by assembly
of sequencing reads
Mandy L. Wilson1, Yizhi Cai1, Regina Hanlon1, Samantha Taylor1, Bastien Chevreux2,
João C. Setubal1, Brett M. Tyler3 and Jean Peccoud1,4,*
1
Virginia Bioinformatics Institute, Virginia Tech, Washington Street MC 0477, Blacksburg, VA 24061, USA,
DSM Nutritional Products Ltd., Department for Human Nutrition & Health, P.O. Box 2676, CH-4002 Basel,
Switzerland, 3Center for Genome Research and Biocomputing, 3021 Agriculture and Life Sciences Building,
Oregon State University, Corvallis, OR 97331-7303 and 4ICTAS Center for Systems Biology of Engineered
Tissues, MC 0193 Virginia Tech, Blacksburg, VA 24061, USA
2
Received July 26, 2012; Revised September 6, 2012; Accepted September 7, 2012
INTRODUCTION
Gene synthesis attempts to assemble user-defined
DNA sequences with base-level precision. Verifying
the sequences of construction intermediates and
the final product of a gene synthesis project is a
critical part of the workflow, yet one that has
received the least attention. Sequence validation is
equally important for other kinds of curated clone
collections. Ensuring that the physical sequence of
a clone matches its published sequence is a
common quality control step performed at least
once over the course of a research project.
GenoREAD is a web-based application that breaks
the sequence verification process into two steps:
the assembly of sequencing reads and the alignment of the resulting contig with a reference
sequence. GenoREAD can determine if a clone
matches its reference sequence. Its sophisticated
reporting features help identify and troubleshoot
problems that arise during the sequence verification
process. GenoREAD has been experimentally
validated on thousands of gene-sized constructs
from an ORFeome project, and on longer sequences
including whole plasmids and synthetic chromosomes. Comparing GenoREAD results with those
from manual analysis of the sequencing data
demonstrates that GenoREAD tends to be conservative in its diagnostic. GenoREAD is available at
www.genoread.org.
Gene synthesis (1,2) is the process of manufacturing
user-defined DNA sequences with base-level precision.
The limitations of the chemistries used at different steps
of the process require scientists to verify the physical
sequence of the clones they produce at the different
stages of the assembly process. The rapid development
and commercial success of new high-throughput
sequencing technologies calls for a careful analysis of the
technology best suited to meet the sequence verification
needs of gene synthesis operators. Difference of throughput, price structure and access to sequencing resources
should be considered in relation to the gene synthesis
facility throughput, nature of the sequences it produces
and other technical and economic constraints. Since the
verification of thousands of 1-kb building blocks is very
different from the verification of a small number of 100-kb
synthetic fragments, different sequencing technologies are
used at different stages of synthetic genomics projects (3).
In this fast-evolving landscape of sequencing technologies,
Sanger sequencing still remains the most commonly used
technology for sequence verification (4,5). While more expensive per base than newer sequencing technologies,
Sanger is less expensive per run, making it more relevant
to the job of clone-verification than it might be for a traditional genome-sized sequence verification project. Sanger
remains the most cost-effective sequencing technology for
most gene synthesis projects focused on assembling sequences that do not exceed a few kilobases in length.
The need to verify the sequence of clones and plasmids
is not limited to gene synthesis; it also applies to
any plasmid containing inserts with known sequences,
*To whom correspondence should be addressed. Tel: +1 540 231 0403; Fax: +1 540 231 2606; Email:
Present addresses:
Yizhi Cai, Johns Hopkins University School of Medicine, High Throughput Biology Center, Baltimore, MD 21205, USA.
João C. Setubal, Department of Biochemistry, University of São Paulo, São Paulo, SP 05508-000, Brazil.
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
ß The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited.
ABSTRACT
PAGE 2 OF 11
e25 Nucleic Acids Research, 2013, Vol. 41, No. 1
achieved by developing automated and validated
sequence verification pipelines that can quickly and predictably analyse large collections of sequencing data with
minimal user input. The Joint BioEnergy Institute
Inventory of Composable Elements (JBEI-ICE) is an
open-source software platform for managing collections
of biological parts (11); it includes a feature called
SequenceChecker that visually aligns sequencing data
with the plasmid’s reference sequence with the goal of
detecting discrepancies. SequenceChecker does not
resolve conflicting reads nor does it determine the
sequence verification status of the clone.
CloneQC is a web-based application (12) developed to
automate the sequence verification of the large number of
clones generated by the Synthetic Yeast 2.0 project
(13,14). CloneQC allows users to upload two archives containing the trace files and the reference sequences. The
sequencing reads are automatically matched with the corresponding reference sequence using BLAST (15). The
forward and reverse reads are then more precisely
aligned with the reference sequence using ClustalW (16).
CloneQC then takes into consideration the alignment
results along with the quality of the read to assign one
of several quality statuses to the clone (Pass, Fail,
Check, Fixable). CloneQC was the first tool to propose
a rigorous algorithm to the verification of clones generated
in the context of a large scale DNA synthesis operation.
Its major limitation is that it cannot handle the verification
of clones longer than the span of two Sanger sequencing
reads, or about 2000 bp.
In this article, we describe GenoREAD, a new sequence
verification application that breaks down the analysis
process into two distinct steps: the assembly of the
sequencing reads into a contig, and the alignment of the
contig with the reference sequence. This approach allows
GenoREAD to verify the sequence of short and long
genetic constructs. The application workflow has been
used on thousands of gene-sized constructs, as well as
longer sequences, such as the complete sequences of
plasmids and a 96-kb synthetic chromosome. GenoREAD
provides sophisticated reporting capabilities that can help
users uncover vario (...truncated)