A new approach for detecting low-level mutations in next-generation sequence data
Li and Stoneking Genome Biology 2012, 13:R34
http://genomebiology.com/2012/13/5/R34
METHOD
Open Access
A new approach for detecting low-level
mutations in next-generation sequence data
Mingkun Li* and Mark Stoneking
Abstract
We propose a new method that incorporates population re-sequencing data, distribution of reads, and strand bias
in detecting low-level mutations. The method can accurately identify low-level mutations down to a level of 2.3%,
with an average coverage of 500×, and with a false discovery rate of less than 1%. In addition, we also discuss
other problems in detecting low-level mutations, including chimeric reads and sample cross-contamination, and
provide possible solutions to them.
Background
Next-generation sequencing (NGS) is now widely used
in biological and medical studies. Most re-sequencing
studies have the goal of identifying homozygous or heterozygous mutations in diploid genomes (that is, mutations present at 50% or 100% frequency in sequence
reads), and use this information to study genome evolution, infer population history, or identify causal genes/
mutations in disease-association studies [1,2]. However,
some applications require the identification of low-level
mutations (LLMs) that are present at frequencies well
below 50% within the population of molecules that is
typically sequenced in an NGS study; examples include
heteroplasmic mutations in mitochondrial DNA
(mtDNA) genomes [3], somatic mutations in tumors [4],
or mutations in pooled DNA samples [5].
Challenges in detecting true LLMs come from sequencing error, library contamination, PCR artifacts, and so
on. Sequencing error is the most common problem; for
instance, the Illumina Genome Analyzer, which is one
of the most popular NGS platforms, has an average
error rate of 0.01 [6]. Moreover, sequencing error is
unevenly distributed along the genome and may be
influenced by the sequence context, position on the
read, and molecule structure, resulting in sequencing
error ‘hot spots’ where the error rate can be ten-fold
greater (or more) than the genome average [3,7-10].
Unfortunately, those features resulting in sequencing
error hot spots have not been fully characterized, thus
* Correspondence:
Department of Evolutionary Genetics, Max Planck Institute for Evolutionary
Anthropology, D04103, Leipzig, Germany
making it difficult to distinguish sequencing errors from
true LLMs [10].
Detecting ‘true’ mutations involves genotype estimation (that is, the mutation frequency is expected to be
0%, 50%, or 100% for diploid data), and methods exist
to provide accurate inference at a coverage of around
20× [2,11]. By contrast, even though much higher
sequencing depth is typically obtained for NGS studies
designed to detect LLMs (often ≥1,000×), the challenge
remains to distinguish LLMs from sequencing errors
[12]. Recently, several attempts have been made, either
by modifying the sequencing library protocol [13,14], or
using control data or population data to identify the
erroneous base call [15-20]. However, most of these
computational methods require some parameters to be
set, such as the expected haplotype number, one or
more threshold(s) to define the real LLM, and/or which
part of the reads to use; hence, these are subjective and
can be difficult to implement.
We analyzed PhiX 174 and mtDNA sequencing data,
and identified sequencing error hot spots, even under a
stringent quality filter, that cannot be explained by the
sequence context. However, we find that sequencing
error is strand-dependent, position-dependent, and the
same sequencing error hot spot repeatedly showed up
among different individuals. Based on these features, we
have developed a new approach to distinguish LLMs
from sequencing errors, which makes use of population
re-sequencing data to estimate the sequencing error
profile, and gives an understandable Phred-like quality
score to present the reliability of the minor allele at
each position. The workflow for the method is outlined
© 2012 Li and Stoneking; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Li and Stoneking Genome Biology 2012, 13:R34
http://genomebiology.com/2012/13/5/R34
in Figure 1. This approach thus provides the investigator
the flexibility of applying different discovery strategies,
that is, a higher false positive rate with a lower false
negative rate, or a lower false positive rate with a higher
false negative rate. We apply our approach to simulated
data, artificial mixtures, and a dataset of complete
mtDNA genome sequences, and we demonstrate the
method can accurately identify LLMs down to a level of
2% (with an average coverage of 500×) with a low false
discovery rate (< 1%). Our method outperformed other
existing software in detecting LLMs, especially at positions where the error allele count is low.
Results
Sequencing error along the genome
Under the quality filter we used (details in Materials and
methods), the average genome-wide error rate (minor
allele frequency) is 0.0009 for the PhiX174 genome, and
0.00167 for the mtDNA genome. This difference could
be caused by heteroplasmy and/or alignment problems
with mtDNA. Generally, the sequencing error rate fluctuated along the genome with some striking peaks
(drops when converted to Phred quality score; Figure S1
in Additional file 1), with two peaks in the PhiX174
Page 2 of 15
genome corresponding to true ‘polymorphic’ positions
(mixture of two different alleles) in this PhiX174 strain
(positions 1401,1644). Outliers in the mtDNA genome
are positions 309 to 311, 514, and 3,106 to 3,107, which
are either caused by alignment problems or true length
heteroplasmies.
Normally, positions with the highest sequencing error
rate cause the most problems in distinguishing LLMs;
hence, we retrieved the 30 positions with the highest
error rate on the PhiX174 genome to visualize the distribution of error rates along reads, as well as 30 positions with the lowest error rate for comparison (Figure
2). First, an obvious error rate difference was observed
between the two strands: positions with high error rates
were mostly dominated by reads mapped to one specific
strand whereas reads mapped to the other strand
showed a normal error rate. Additionally, the error rate
also varied among different parts of the reads; although
error rate tended to increase when closer to the end,
the trend is much weaker on the reads from the lowerror strand (Figure 2).
We used WebLogo [21] to identify possible conserved
motifs preceding the sequencing error hot spots (Figure
S2A in Additional file 1). Although ‘GGT’ was found to
Figure 1 Workflow of the pipeline. For each position in the target region, samples having the same consensus nucleotide are used as
reference samples (...truncated)