EBSeq-HMM: a Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments
Bioinformatics, 31(16), 2015, 2614–2622
doi: 10.1093/bioinformatics/btv193
Advance Access Publication Date: 5 April 2015
Original Paper
Sequence analysis
EBSeq-HMM: a Bayesian approach for
identifying gene-expression changes in
ordered RNA-seq experiments
Ning Leng1,2, Yuan Li1, Brian E. McIntosh2, Bao Kim Nguyen2,
Bret Duffin2, Shulan Tian2, James A. Thomson2,3,4, Colin N. Dewey5,
Ron Stewart2 and Christina Kendziorski5,*
1
Department of Statistics, University of Wisconsin, Madison, WI, USA, 2Regenerative Biology, Morgridge Institute
for Research, Madison, WI, USA, 3Department of Cell and Regenerative Biology, University of Wisconsin School of
Medicine and Public Health, Madison, WI, USA, 4Department of Molecular, Cellular, and Developmental Biology,
University of California, Santa Barbara, CA, USA and 5Department of Biostatistics and Medical Informatics,
University of Wisconsin, Madison, WI, USA
*To whom correspondence should be addressed.
Associate Editor: Inanc Birol
Received on October 14, 2014; revised on February 23, 2015; accepted on March 30, 2015
Abstract
Motivation: With improvements in next-generation sequencing technologies and reductions in
price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is identifying genes that are changing over time or space, for example, and then characterizing the specific expression changes. A number of robust statistical methods are available to
identify genes showing differential expression among multiple conditions, but most assume conditions are exchangeable and thereby sacrifice power and precision when applied to ordered data.
Results: We propose an empirical Bayes mixture modeling approach called EBSeq-HMM. In
EBSeq-HMM, an auto-regressive hidden Markov model is implemented to accommodate dependence in gene expression across ordered conditions. As demonstrated in simulation and case
studies, the output proves useful in identifying differentially expressed genes and in specifying
gene-specific expression paths. EBSeq-HMM may also be used for inference regarding isoform
expression.
Availability and implementation: An R package containing examples and sample datasets is available at Bioconductor.
Contact:
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
With improvements in next-generation sequencing technologies and
reductions in price, ordered RNA-seq experiments are becoming
common. Of primary interest in these experiments is characterizing
how genes are changing over some factor with ordered levels (for
example, ordered in time, in space, along a gradient, etc).
C The Author 2015. Published by Oxford University Press.
V
For simplicity, we refer to any ordered RNA-seq experiment as a
time-course experiment, noting that other similar designs may be
analyzed within this framework; and we restrict attention to timecourse data collected within a single biological condition.
In a time-course RNA-seq experiment, an investigator may be
interested in genes that are monotonically increasing or decreasing,
2614
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
EBSeq-HMM: Identify expression changes in ordered RNA-seq experiments
To address these considerations, we have developed an empirical
Bayes auto-regressive hidden Markov model (HMM) based approach called EBSeq-HMM. The model extends our previous work,
EBSeq, for identifying DE genes and isoforms across two or more
biological conditions (Leng et al., 2013). As detailed in Methods, an
auto-regressive process describes changes in expression over time,
and a hidden Markov component is used to accommodate dependence. EBSeq-HMM allows users to identify genes with non-constant
expression over multiple ordered conditions, and simultaneously
classify them into expression paths. Results from a simulation study,
detailed in Section 3.1, suggest that EBSeq-HMM has increased
power over competing approaches for identifying genes following
non-constant paths, especially for those genes showing subtle yet
consistent changes over time. EBSeq-HMM also provides improved
accuracy in classifying genes into expression paths. Similar results
are demonstrated in a case study of the adult mouse limb presented
in Section 3.2.
2 Methods
2.1 EBSeq-HMM: an empirical Bayes auto-regressive
Hidden Markov model
EBSeq-HMM requires estimates of gene or isoform expression collected over three or more ordered levels of a factor. The general
model is presented for gene-level analysis; the isoform-level model is
discussed in Section 2.3. To simplify the presentation, we refer to
ordered levels as time points denoted by t ¼ 1; 2; . . . ; T, noting that
the method directly accommodates other ordered data structures
(e.g. ordered in space, along a gradient, etc.).
Let Xt be a G Nt matrix of expression values for G genes in Nt
samples at time t. The full set of observed expression values is then
denoted by X ¼ ðX1 ; X2 ; . . . ; XT Þ. With a slight abuse of notation,
let Xg denote one row of this matrix containing data for gene g over
time; Xgtn denotes expression values for gene g at time t in sample n.
Of interest are changes in the latent mean expression levels for gene
g: lg1 ; lg2 ; . . . ; lgT . We allow for three possibilities, or states, to describe such changes: Up, Down, EE. If lt1 < lt , we define state SDt
as Up; if lt1 > lt ; SDt is Down and lt1 ¼ lt defines SDt as EE. The
main goals in an ordered RNA-seq experiment—identifying genes
that change over time, and specifying each genes’ expression path—
can be restated as questions about these underlying states. In short,
for each gene g and each transition between t1 and t, we would
like to estimate the probability of each state. A gene is said to follow
a non-constant path if at least one state is not EE. We would also
like to estimate the most likely expression path, which is given by
D3
DT
the configuration of expression states over time (SD2
g ; Sg ; . . . ; Sg ),
noting that the most likely configuration of states need not equal the
collection of states that define SDt
g marginally at each t (an example
is provided in Section 3.1).
To make inference regarding these states, we propose a
model for the set of expression measurements taken on a gene g.
We make the common and well-supported assumption that gene
expression in an RNA-seq experiment is well described by a NB distribution (Anders and Huber, 2010; Hardcastle and Kelly, 2010;
Love et al., 2014; Nueda et al., 2014; Robinson et al., 2010;
Trapnell et al., 2012). Were we to consider time t in isolation,
this implies Xgtn jrgt ; qgt NBðrgt ; qgt Þ where the NB distribution may be parameterized such th (...truncated)