Sequence verification of synthetic DNA by assembly of sequencing reads (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/41/1/e25.full.pdf

Sequence verification of synthetic DNA by assembly of sequencing reads

Published online 4 October 2012 Nucleic Acids Research, 2013, Vol. 41, No. 1 e25 doi:10.1093/nar/gks908 Sequence verification of synthetic DNA by assembly of sequencing reads Mandy L. Wilson1, Yizhi Cai1, Regina Hanlon1, Samantha Taylor1, Bastien Chevreux2, João C. Setubal1, Brett M. Tyler3 and Jean Peccoud1,4,* 1 Virginia Bioinformatics Institute, Virginia Tech, Washington Street MC 0477, Blacksburg, VA 24061, USA, DSM Nutritional Products Ltd., Department for Human Nutrition & Health, P.O. Box 2676, CH-4002 Basel, Switzerland, 3Center for Genome Research and Biocomputing, 3021 Agriculture and Life Sciences Building, Oregon State University, Corvallis, OR 97331-7303 and 4ICTAS Center for Systems Biology of Engineered Tissues, MC 0193 Virginia Tech, Blacksburg, VA 24061, USA 2 Received July 26, 2012; Revised September 6, 2012; Accepted September 7, 2012 INTRODUCTION Gene synthesis attempts to assemble user-defined DNA sequences with base-level precision. Verifying the sequences of construction intermediates and the final product of a gene synthesis project is a critical part of the workflow, yet one that has received the least attention. Sequence validation is equally important for other kinds of curated clone collections. Ensuring that the physical sequence of a clone matches its published sequence is a common quality control step performed at least once over the course of a research project. GenoREAD is a web-based application that breaks the sequence verification process into two steps: the assembly of sequencing reads and the alignment of the resulting contig with a reference sequence. GenoREAD can determine if a clone matches its reference sequence. Its sophisticated reporting features help identify and troubleshoot problems that arise during the sequence verification process. GenoREAD has been experimentally validated on thousands of gene-sized constructs from an ORFeome project, and on longer sequences including whole plasmids and synthetic chromosomes. Comparing GenoREAD results with those from manual analysis of the sequencing data demonstrates that GenoREAD tends to be conservative in its diagnostic. GenoREAD is available at www.genoread.org. Gene synthesis (1,2) is the process of manufacturing user-deﬁned DNA sequences with base-level precision. The limitations of the chemistries used at different steps of the process require scientists to verify the physical sequence of the clones they produce at the different stages of the assembly process. The rapid development and commercial success of new high-throughput sequencing technologies calls for a careful analysis of the technology best suited to meet the sequence veriﬁcation needs of gene synthesis operators. Difference of throughput, price structure and access to sequencing resources should be considered in relation to the gene synthesis facility throughput, nature of the sequences it produces and other technical and economic constraints. Since the veriﬁcation of thousands of 1-kb building blocks is very different from the veriﬁcation of a small number of 100-kb synthetic fragments, different sequencing technologies are used at different stages of synthetic genomics projects (3). In this fast-evolving landscape of sequencing technologies, Sanger sequencing still remains the most commonly used technology for sequence veriﬁcation (4,5). While more expensive per base than newer sequencing technologies, Sanger is less expensive per run, making it more relevant to the job of clone-veriﬁcation than it might be for a traditional genome-sized sequence veriﬁcation project. Sanger remains the most cost-effective sequencing technology for most gene synthesis projects focused on assembling sequences that do not exceed a few kilobases in length. The need to verify the sequence of clones and plasmids is not limited to gene synthesis; it also applies to any plasmid containing inserts with known sequences, *To whom correspondence should be addressed. Tel: +1 540 231 0403; Fax: +1 540 231 2606; Email: Present addresses: Yizhi Cai, Johns Hopkins University School of Medicine, High Throughput Biology Center, Baltimore, MD 21205, USA. João C. Setubal, Department of Biochemistry, University of São Paulo, São Paulo, SP 05508-000, Brazil. The authors wish it to be known that, in their opinion, the ﬁrst two authors should be regarded as joint First Authors. ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited. ABSTRACT PAGE 2 OF 11 e25 Nucleic Acids Research, 2013, Vol. 41, No. 1 achieved by developing automated and validated sequence veriﬁcation pipelines that can quickly and predictably analyse large collections of sequencing data with minimal user input. The Joint BioEnergy Institute Inventory of Composable Elements (JBEI-ICE) is an open-source software platform for managing collections of biological parts (11); it includes a feature called SequenceChecker that visually aligns sequencing data with the plasmid’s reference sequence with the goal of detecting discrepancies. SequenceChecker does not resolve conﬂicting reads nor does it determine the sequence veriﬁcation status of the clone. CloneQC is a web-based application (12) developed to automate the sequence veriﬁcation of the large number of clones generated by the Synthetic Yeast 2.0 project (13,14). CloneQC allows users to upload two archives containing the trace ﬁles and the reference sequences. The sequencing reads are automatically matched with the corresponding reference sequence using BLAST (15). The forward and reverse reads are then more precisely aligned with the reference sequence using ClustalW (16). CloneQC then takes into consideration the alignment results along with the quality of the read to assign one of several quality statuses to the clone (Pass, Fail, Check, Fixable). CloneQC was the ﬁrst tool to propose a rigorous algorithm to the veriﬁcation of clones generated in the context of a large scale DNA synthesis operation. Its major limitation is that it cannot handle the veriﬁcation of clones longer than the span of two Sanger sequencing reads, or about 2000 bp. In this article, we describe GenoREAD, a new sequence veriﬁcation application that breaks down the analysis process into two distinct steps: the assembly of the sequencing reads into a contig, and the alignment of the contig with the reference sequence. This approach allows GenoREAD to verify the sequence of short and long genetic constructs. The application workﬂow has been used on thousands of gene-sized constructs, as well as longer sequences, such as the complete sequences of plasmids and a 96-kb synthetic chromosome. GenoREAD provides sophisticated reporting capabilities that can help users uncover vario (...truncated)