An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0125224&type=printable

An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data

RESEARCH ARTICLE An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data Nysia I. George1, John F. Bowyer2, Nathaniel M. Crabtree3, Ching-Wei Chang1* 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, Arkansas, United States of America, 2 Division of Neurotoxicology, National Center for Toxicological Research, FDA, Jefferson, Arkansas, United States of America, 3 Joint Bioinformatics Graduate Program, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, Arkansas, United States of America * Abstract OPEN ACCESS Citation: George NI, Bowyer JF, Crabtree NM, Chang C-W (2015) An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data. PLoS ONE 10(6): e0125224. doi:10.1371/journal. pone.0125224 Academic Editor: Christophe Antoniewski, CNRS UMR7622 & University Paris 6 Pierre-et-Marie-Curie, FRANCE Received: November 21, 2014 Accepted: March 22, 2015 The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data. Published: June 3, 2015 Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability Statement: All relevant data for the Wang et al. dataset are within the paper and the Supporting Information files. All relevant data for the main dataset (the control group of sample size 16) is in GEO database (GSE62368). Funding: The authors received no specific funding for this work. Competing Interests: The authors have declared that no competing interests exist. Introduction The rise of RNA sequencing (RNA-seq) as a competing tool for differential expression analysis has launched considerable efforts to develop methods that effectively model and analyze count data produced by RNA-seq experiments. Unlike microarray experiments, which produce continuous probe intensities, RNA-seq measures RNA content through digital expression profiling by counting the number of sequencing reads that map to a particular feature (e.g. exon, gene, or transcript). Given the dynamic range of RNA-seq data and practically no ceiling for quantification, extreme high counts (i.e. outliers) for a given feature are often present in one or more RNA samples within an experimental group. The presence of outliers substantially limits the power of differential testing [1,2]. RNA-seq counts are influenced by a number of decisions that must be made to generate expression data from total RNA. As a result, outlier read counts may arise from one of many PLOS ONE | DOI:10.1371/journal.pone.0125224 June 3, 2015 1 / 10 iLOO: Outlier Detection in RNA-Seq Data stages, including biological harvesting of RNA, design implementation, and data processing techniques. For example, in animal studies, in order to obtain sufficient levels of RNA to be sequenced, multiple needle punctures might be necessary to acquire enough of the tissue to be sampled from a relatively small body size. In this case, the possibility of collecting a sample from non-target tissues increases, which could potentially affect read counts in a subset of features. On the data preprocessing side, the selected mapping pipeline and library construction also affect read counts. Since outliers may have biological or technical origins, accurately detecting outliers may help a researcher pinpoint their source and ensure data quality. Normalization is often the initial step to correct for artifacts in measured expression data (see [3] for an overview of different normalization methods). Typically, a scaling normalization method is implemented when the downstream analysis requires count-based statistical analysis. The primary goal of scaling factor normalization is to minimize between-sample variability for invariant genes by adjusting the sequencing depth of each replicate sample. edgeR [4] computes a scaling factor for each sample using the trimmed mean of M-values (log ratio of counts in a sample to counts in a reference sample) [5]. Alternatively, DESeq2 [6] uses the median of the ratio of counts for a sample to the geometric mean of counts over all samples [7]. Despite the advantages of normalization, normalization procedures cannot adjust for all sources of unknown variation as is evidenced by the fact that both edgeR and DESeq2 incorporate outlier detection methods to improve the robustness of differential analysis. To date, only a handful of existing R packages identify count-based outliers in RNA-seq data analysis. Zhou et al. introduced a robust method of down-weighting extreme values that could be used within existing testing frameworks. In their work, an observation with a large Pearson residual from a fitted negative binomial generalized linear model is attributed a smaller Huber weight [8]. The resulting method, denoted herein as edgeR-robust, can be implemented in edgeR. DESeq2 employs Cook’s [9] distance to measure the degree of influence of a single observation on fitted coefficients of a linear model. Although Huber’s estimate presents a robust approach to down-weight deviant expressions, its sensitivity to extreme outliers can potentially hinder accurate outlier detection in skewed data. Cook’s distance uses a regression-based method to identify influential observations, and thus also may not be optimal for sparse or skewed data. edgeR-robust and Cook’s distance are both carried out within their respective testing methodologies. Thus, we consider isolating read count outliers outside of a test-based strategy. In this work, outlier detection is implemented via a univariate algorithm that is built on two concepts: probabilities associated with the assumed null (...truncated)