An ensemble approach to accurately detect somatic mutations using SomaticSeq
Fang et al. Genome Biology (2015) 16:197
DOI 10.1186/s13059-015-0758-2
SOFTWAR E
Open Access
An ensemble approach to accurately
detect somatic mutations using SomaticSeq
Li Tai Fang1† , Pegah Tootoonchi Afshar2† , Aparna Chhibber1 , Marghoob Mohiyuddin1 , Yu Fan3 ,
John C. Mu1 , Greg Gibeling1 , Sharon Barr1 , Narges Bani Asadi1 , Mark B. Gerstein4 , Daniel C. Koboldt5 ,
Wenyi Wang3 , Wing H. Wong6,7 and Hugo Y.K. Lam1*
Abstract
SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to
produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions.
The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual
genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision
tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real
data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.
Background
Cancers are diseases of the genome. Somatic single
nucleotide variants (SNVs) and small insertions and deletions (indels) are common drivers of carcinogenesis.
Therefore, accurately detecting somatic mutations is a
key analysis in cancer research. The challenge and complexity of cancer sequencing analysis lie in the heterogeneous nature of tumor samples, in addition to the
cross-contamination between tumor and matched normal
samples.
A somatic tool that performs well for one tumor may
perform poorly for another, as reported in a number
of comparative studies [1, 2]. For instance, MuTect is a
somatic SNV caller that applies a Bayesian classifier to
detect somatic mutations [3]. It is sensitive in detecting low variant allele frequency (VAF) somatic variants.
It also incorporates a series of filters to penalize candidate variants that have characteristics corresponding
to sequencing artifacts to increase precision. However,
MuTect applies severe penalties to somatic variant candidates if the variant reads are also found in the matched
normal. While this approach filters out most germline
variant false positives, it adversely affects sensitivity in
*Correspondence:
† Equal contributors
1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA
some cancer types where it is not possible to obtain a clean
normal sample, e.g., liquid cancers.
SomaticSniper was developed with the aforementioned
issue in mind [4]. It applies a Bayesian model to detect
genotype change between the normal and tumor tissues, taking into account the prior probability of somatic
mutation. Thus, it is far more tolerant of impure normal
samples at the expense of calling a lot more germline variants as somatic. It is also less sensitive toward low VAF
mutations. Another Bayesian approach is JointSNVMix2,
which jointly analyses paired tumor–normal digital allelic
count data [5]. It has very high sensitivity in many different
settings, but tends to be lower in precision.
A different statistical approach is using Fisher’s exact
test (FET) to detect genotype change, such as VarScan2
and VarDict. VarScan2 reads data from both tumor and
normal samples simultaneously and classifies sequence
variants by somatic status [6]. At high enough sequencing
depth, even a slight change in VAF between the normal
and tumor may result in statistical significance by FET,
thus calling many germline variants as somatic mutations.
On the other hand, VarScan2 will not miss clear mutations
due to situation-specific filters that may not appropriately
apply in all situations. VarDict is specifically designed to
detect important but challenging variants that tend to be
missed or ignored by other callers. It applies a series of
false positive filters to increase precision [7]. It can handle
Full list of author information is available at the end of the article
© 2015 Fang et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any
medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons
license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.
org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Fang et al. Genome Biology (2015) 16:197
ultra-deep sequencing with depth up to hundreds of thousands, where most algorithms would either fail or perform
poorly.
Given the unique characteristics of each algorithm,
integrating them is often desirable to ensure mutations are comprehensively captured [8–10]. On the other
hand, combining the false positives from all the different algorithms can easily overwhelm the results. Accurately distinguishing true somatic mutations from the
false positives is thus essential in accurate interpretation.
Simple rule-based filters can often remove the majority of false positives due to sequencing artifacts, e.g.,
Database of Single-Nucleotide Polymorphisms (dbSNP)
sites, extreme strand bias, nearby homopolymers, low
mapping quality, proximity to end of reads, proximity
to indels, and extremely low or high read depth [4, 6].
However, hard filters also significantly reduce the sensitivity and permanently remove certain mutations from ever
being detected due to their locations within the genome.
Previously, Kim et al. built a combined caller using logistic
regression with a feature-weighted linear stacking (FWLS)
model to improve somatic SNV prediction accuracy [11].
The model considers the degree of consensus of three
callers in addition to a series of associated features. It calculates a probability value (0 ≤ P ≤ 1) for each mutation
candidate; however, which cut-off value to choose is not
always obvious. Since the study did not perform somatic
indel analysis, its performance on non-substitution variants is unclear.
To address these aforementioned problems, we propose SomaticSeq. It implements a machine-learning algorithm that accurately identifies both somatic SNVs and
indels from tumor–normal pairs. It maximizes its sensitivity by combining SNV calls from the five previously
described algorithms that complement each other, i.e.,
MuTect, SomaticSniper, VarScan2, JointSNVMix2, and
VarDict. It combines somatic indel calls from Indelocator
Page 2 of 13
[12], VarScan2, and VarDict. For each mutation call, we
generate up to 72 features by SAMtools, HaplotypeCaller,
and the callers themselves. We have implemented the
Adaptive Boosting model in R using the ada package [13],
which constructs a classifier consisting of an ensemble of
decision trees from a training set. The classifier is then
applied to a target s (...truncated)