An ensemble approach to accurately detect somatic mutations using SomaticSeq (pdf)

Article PDF cannot be displayed. You can download it here:

http://genomebiology.com/content/pdf/s13059-015-0758-2.pdf

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Fang et al. Genome Biology (2015) 16:197 DOI 10.1186/s13059-015-0758-2 SOFTWAR E Open Access An ensemble approach to accurately detect somatic mutations using SomaticSeq Li Tai Fang1† , Pegah Tootoonchi Afshar2† , Aparna Chhibber1 , Marghoob Mohiyuddin1 , Yu Fan3 , John C. Mu1 , Greg Gibeling1 , Sharon Barr1 , Narges Bani Asadi1 , Mark B. Gerstein4 , Daniel C. Koboldt5 , Wenyi Wang3 , Wing H. Wong6,7 and Hugo Y.K. Lam1* Abstract SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated. Background Cancers are diseases of the genome. Somatic single nucleotide variants (SNVs) and small insertions and deletions (indels) are common drivers of carcinogenesis. Therefore, accurately detecting somatic mutations is a key analysis in cancer research. The challenge and complexity of cancer sequencing analysis lie in the heterogeneous nature of tumor samples, in addition to the cross-contamination between tumor and matched normal samples. A somatic tool that performs well for one tumor may perform poorly for another, as reported in a number of comparative studies [1, 2]. For instance, MuTect is a somatic SNV caller that applies a Bayesian classifier to detect somatic mutations [3]. It is sensitive in detecting low variant allele frequency (VAF) somatic variants. It also incorporates a series of filters to penalize candidate variants that have characteristics corresponding to sequencing artifacts to increase precision. However, MuTect applies severe penalties to somatic variant candidates if the variant reads are also found in the matched normal. While this approach filters out most germline variant false positives, it adversely affects sensitivity in *Correspondence: † Equal contributors 1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA some cancer types where it is not possible to obtain a clean normal sample, e.g., liquid cancers. SomaticSniper was developed with the aforementioned issue in mind [4]. It applies a Bayesian model to detect genotype change between the normal and tumor tissues, taking into account the prior probability of somatic mutation. Thus, it is far more tolerant of impure normal samples at the expense of calling a lot more germline variants as somatic. It is also less sensitive toward low VAF mutations. Another Bayesian approach is JointSNVMix2, which jointly analyses paired tumor–normal digital allelic count data [5]. It has very high sensitivity in many different settings, but tends to be lower in precision. A different statistical approach is using Fisher’s exact test (FET) to detect genotype change, such as VarScan2 and VarDict. VarScan2 reads data from both tumor and normal samples simultaneously and classifies sequence variants by somatic status [6]. At high enough sequencing depth, even a slight change in VAF between the normal and tumor may result in statistical significance by FET, thus calling many germline variants as somatic mutations. On the other hand, VarScan2 will not miss clear mutations due to situation-specific filters that may not appropriately apply in all situations. VarDict is specifically designed to detect important but challenging variants that tend to be missed or ignored by other callers. It applies a series of false positive filters to increase precision [7]. It can handle Full list of author information is available at the end of the article © 2015 Fang et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons. org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Fang et al. Genome Biology (2015) 16:197 ultra-deep sequencing with depth up to hundreds of thousands, where most algorithms would either fail or perform poorly. Given the unique characteristics of each algorithm, integrating them is often desirable to ensure mutations are comprehensively captured [8–10]. On the other hand, combining the false positives from all the different algorithms can easily overwhelm the results. Accurately distinguishing true somatic mutations from the false positives is thus essential in accurate interpretation. Simple rule-based filters can often remove the majority of false positives due to sequencing artifacts, e.g., Database of Single-Nucleotide Polymorphisms (dbSNP) sites, extreme strand bias, nearby homopolymers, low mapping quality, proximity to end of reads, proximity to indels, and extremely low or high read depth [4, 6]. However, hard filters also significantly reduce the sensitivity and permanently remove certain mutations from ever being detected due to their locations within the genome. Previously, Kim et al. built a combined caller using logistic regression with a feature-weighted linear stacking (FWLS) model to improve somatic SNV prediction accuracy [11]. The model considers the degree of consensus of three callers in addition to a series of associated features. It calculates a probability value (0 ≤ P ≤ 1) for each mutation candidate; however, which cut-off value to choose is not always obvious. Since the study did not perform somatic indel analysis, its performance on non-substitution variants is unclear. To address these aforementioned problems, we propose SomaticSeq. It implements a machine-learning algorithm that accurately identifies both somatic SNVs and indels from tumor–normal pairs. It maximizes its sensitivity by combining SNV calls from the five previously described algorithms that complement each other, i.e., MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict. It combines somatic indel calls from Indelocator Page 2 of 13 [12], VarScan2, and VarDict. For each mutation call, we generate up to 72 features by SAMtools, HaplotypeCaller, and the callers themselves. We have implemented the Adaptive Boosting model in R using the ada package [13], which constructs a classifier consisting of an ensemble of decision trees from a training set. The classifier is then applied to a target s (...truncated)