A new approach for detecting low-level mutations in next-generation sequence data (pdf)

Article PDF cannot be displayed. You can download it here:

http://genomebiology.com/content/pdf/gb-2012-13-5-r34.pdf

A new approach for detecting low-level mutations in next-generation sequence data

Li and Stoneking Genome Biology 2012, 13:R34 http://genomebiology.com/2012/13/5/R34 METHOD Open Access A new approach for detecting low-level mutations in next-generation sequence data Mingkun Li* and Mark Stoneking Abstract We propose a new method that incorporates population re-sequencing data, distribution of reads, and strand bias in detecting low-level mutations. The method can accurately identify low-level mutations down to a level of 2.3%, with an average coverage of 500×, and with a false discovery rate of less than 1%. In addition, we also discuss other problems in detecting low-level mutations, including chimeric reads and sample cross-contamination, and provide possible solutions to them. Background Next-generation sequencing (NGS) is now widely used in biological and medical studies. Most re-sequencing studies have the goal of identifying homozygous or heterozygous mutations in diploid genomes (that is, mutations present at 50% or 100% frequency in sequence reads), and use this information to study genome evolution, infer population history, or identify causal genes/ mutations in disease-association studies [1,2]. However, some applications require the identification of low-level mutations (LLMs) that are present at frequencies well below 50% within the population of molecules that is typically sequenced in an NGS study; examples include heteroplasmic mutations in mitochondrial DNA (mtDNA) genomes [3], somatic mutations in tumors [4], or mutations in pooled DNA samples [5]. Challenges in detecting true LLMs come from sequencing error, library contamination, PCR artifacts, and so on. Sequencing error is the most common problem; for instance, the Illumina Genome Analyzer, which is one of the most popular NGS platforms, has an average error rate of 0.01 [6]. Moreover, sequencing error is unevenly distributed along the genome and may be influenced by the sequence context, position on the read, and molecule structure, resulting in sequencing error ‘hot spots’ where the error rate can be ten-fold greater (or more) than the genome average [3,7-10]. Unfortunately, those features resulting in sequencing error hot spots have not been fully characterized, thus * Correspondence: Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, D04103, Leipzig, Germany making it difficult to distinguish sequencing errors from true LLMs [10]. Detecting ‘true’ mutations involves genotype estimation (that is, the mutation frequency is expected to be 0%, 50%, or 100% for diploid data), and methods exist to provide accurate inference at a coverage of around 20× [2,11]. By contrast, even though much higher sequencing depth is typically obtained for NGS studies designed to detect LLMs (often ≥1,000×), the challenge remains to distinguish LLMs from sequencing errors [12]. Recently, several attempts have been made, either by modifying the sequencing library protocol [13,14], or using control data or population data to identify the erroneous base call [15-20]. However, most of these computational methods require some parameters to be set, such as the expected haplotype number, one or more threshold(s) to define the real LLM, and/or which part of the reads to use; hence, these are subjective and can be difficult to implement. We analyzed PhiX 174 and mtDNA sequencing data, and identified sequencing error hot spots, even under a stringent quality filter, that cannot be explained by the sequence context. However, we find that sequencing error is strand-dependent, position-dependent, and the same sequencing error hot spot repeatedly showed up among different individuals. Based on these features, we have developed a new approach to distinguish LLMs from sequencing errors, which makes use of population re-sequencing data to estimate the sequencing error profile, and gives an understandable Phred-like quality score to present the reliability of the minor allele at each position. The workflow for the method is outlined © 2012 Li and Stoneking; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Li and Stoneking Genome Biology 2012, 13:R34 http://genomebiology.com/2012/13/5/R34 in Figure 1. This approach thus provides the investigator the flexibility of applying different discovery strategies, that is, a higher false positive rate with a lower false negative rate, or a lower false positive rate with a higher false negative rate. We apply our approach to simulated data, artificial mixtures, and a dataset of complete mtDNA genome sequences, and we demonstrate the method can accurately identify LLMs down to a level of 2% (with an average coverage of 500×) with a low false discovery rate (< 1%). Our method outperformed other existing software in detecting LLMs, especially at positions where the error allele count is low. Results Sequencing error along the genome Under the quality filter we used (details in Materials and methods), the average genome-wide error rate (minor allele frequency) is 0.0009 for the PhiX174 genome, and 0.00167 for the mtDNA genome. This difference could be caused by heteroplasmy and/or alignment problems with mtDNA. Generally, the sequencing error rate fluctuated along the genome with some striking peaks (drops when converted to Phred quality score; Figure S1 in Additional file 1), with two peaks in the PhiX174 Page 2 of 15 genome corresponding to true ‘polymorphic’ positions (mixture of two different alleles) in this PhiX174 strain (positions 1401,1644). Outliers in the mtDNA genome are positions 309 to 311, 514, and 3,106 to 3,107, which are either caused by alignment problems or true length heteroplasmies. Normally, positions with the highest sequencing error rate cause the most problems in distinguishing LLMs; hence, we retrieved the 30 positions with the highest error rate on the PhiX174 genome to visualize the distribution of error rates along reads, as well as 30 positions with the lowest error rate for comparison (Figure 2). First, an obvious error rate difference was observed between the two strands: positions with high error rates were mostly dominated by reads mapped to one specific strand whereas reads mapped to the other strand showed a normal error rate. Additionally, the error rate also varied among different parts of the reads; although error rate tended to increase when closer to the end, the trend is much weaker on the reads from the lowerror strand (Figure 2). We used WebLogo [21] to identify possible conserved motifs preceding the sequencing error hot spots (Figure S2A in Additional file 1). Although ‘GGT’ was found to Figure 1 Workflow of the pipeline. For each position in the target region, samples having the same consensus nucleotide are used as reference samples (...truncated)