An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data
RESEARCH ARTICLE
An Iterative Leave-One-Out Approach to
Outlier Detection in RNA-Seq Data
Nysia I. George1, John F. Bowyer2, Nathaniel M. Crabtree3, Ching-Wei Chang1*
1 Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson,
Arkansas, United States of America, 2 Division of Neurotoxicology, National Center for Toxicological
Research, FDA, Jefferson, Arkansas, United States of America, 3 Joint Bioinformatics Graduate Program,
University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, Arkansas,
United States of America
*
Abstract
OPEN ACCESS
Citation: George NI, Bowyer JF, Crabtree NM,
Chang C-W (2015) An Iterative Leave-One-Out
Approach to Outlier Detection in RNA-Seq Data.
PLoS ONE 10(6): e0125224. doi:10.1371/journal.
pone.0125224
Academic Editor: Christophe Antoniewski, CNRS
UMR7622 & University Paris 6 Pierre-et-Marie-Curie,
FRANCE
Received: November 21, 2014
Accepted: March 22, 2015
The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data
is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations
and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for
RNA-seq data are executed within a testing framework and may be sensitive to sparse data
and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a
probabilistic approach to measure the deviation between an observation and the distribution
generating the remaining data and implement it within in an iterative leave-one-out design
strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.
Published: June 3, 2015
Copyright: This is an open access article, free of all
copyright, and may be freely reproduced, distributed,
transmitted, modified, built upon, or otherwise used
by anyone for any lawful purpose. The work is made
available under the Creative Commons CC0 public
domain dedication.
Data Availability Statement: All relevant data for the
Wang et al. dataset are within the paper and the
Supporting Information files. All relevant data for the
main dataset (the control group of sample size 16) is
in GEO database (GSE62368).
Funding: The authors received no specific funding
for this work.
Competing Interests: The authors have declared
that no competing interests exist.
Introduction
The rise of RNA sequencing (RNA-seq) as a competing tool for differential expression analysis
has launched considerable efforts to develop methods that effectively model and analyze count
data produced by RNA-seq experiments. Unlike microarray experiments, which produce continuous probe intensities, RNA-seq measures RNA content through digital expression profiling
by counting the number of sequencing reads that map to a particular feature (e.g. exon, gene,
or transcript). Given the dynamic range of RNA-seq data and practically no ceiling for quantification, extreme high counts (i.e. outliers) for a given feature are often present in one or more
RNA samples within an experimental group. The presence of outliers substantially limits the
power of differential testing [1,2].
RNA-seq counts are influenced by a number of decisions that must be made to generate expression data from total RNA. As a result, outlier read counts may arise from one of many
PLOS ONE | DOI:10.1371/journal.pone.0125224 June 3, 2015
1 / 10
iLOO: Outlier Detection in RNA-Seq Data
stages, including biological harvesting of RNA, design implementation, and data processing
techniques. For example, in animal studies, in order to obtain sufficient levels of RNA to be sequenced, multiple needle punctures might be necessary to acquire enough of the tissue to be
sampled from a relatively small body size. In this case, the possibility of collecting a sample
from non-target tissues increases, which could potentially affect read counts in a subset of features. On the data preprocessing side, the selected mapping pipeline and library construction
also affect read counts. Since outliers may have biological or technical origins, accurately detecting outliers may help a researcher pinpoint their source and ensure data quality.
Normalization is often the initial step to correct for artifacts in measured expression data
(see [3] for an overview of different normalization methods). Typically, a scaling normalization
method is implemented when the downstream analysis requires count-based statistical analysis. The primary goal of scaling factor normalization is to minimize between-sample variability
for invariant genes by adjusting the sequencing depth of each replicate sample. edgeR [4] computes a scaling factor for each sample using the trimmed mean of M-values (log ratio of counts
in a sample to counts in a reference sample) [5]. Alternatively, DESeq2 [6] uses the median of
the ratio of counts for a sample to the geometric mean of counts over all samples [7]. Despite
the advantages of normalization, normalization procedures cannot adjust for all sources of unknown variation as is evidenced by the fact that both edgeR and DESeq2 incorporate outlier detection methods to improve the robustness of differential analysis.
To date, only a handful of existing R packages identify count-based outliers in RNA-seq
data analysis. Zhou et al. introduced a robust method of down-weighting extreme values that
could be used within existing testing frameworks. In their work, an observation with a large
Pearson residual from a fitted negative binomial generalized linear model is attributed a smaller
Huber weight [8]. The resulting method, denoted herein as edgeR-robust, can be implemented
in edgeR. DESeq2 employs Cook’s [9] distance to measure the degree of influence of a single observation on fitted coefficients of a linear model. Although Huber’s estimate presents a robust
approach to down-weight deviant expressions, its sensitivity to extreme outliers can potentially
hinder accurate outlier detection in skewed data. Cook’s distance uses a regression-based method to identify influential observations, and thus also may not be optimal for sparse or
skewed data.
edgeR-robust and Cook’s distance are both carried out within their respective testing methodologies. Thus, we consider isolating read count outliers outside of a test-based strategy. In this
work, outlier detection is implemented via a univariate algorithm that is built on two concepts:
probabilities associated with the assumed null (...truncated)