EBSeq-HMM: a Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/31/16/2614.full.pdf

EBSeq-HMM: a Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments

Bioinformatics, 31(16), 2015, 2614–2622 doi: 10.1093/bioinformatics/btv193 Advance Access Publication Date: 5 April 2015 Original Paper Sequence analysis EBSeq-HMM: a Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments Ning Leng1,2, Yuan Li1, Brian E. McIntosh2, Bao Kim Nguyen2, Bret Duffin2, Shulan Tian2, James A. Thomson2,3,4, Colin N. Dewey5, Ron Stewart2 and Christina Kendziorski5,* 1 Department of Statistics, University of Wisconsin, Madison, WI, USA, 2Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA, 3Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA, 4Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Barbara, CA, USA and 5Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA *To whom correspondence should be addressed. Associate Editor: Inanc Birol Received on October 14, 2014; revised on February 23, 2015; accepted on March 30, 2015 Abstract Motivation: With improvements in next-generation sequencing technologies and reductions in price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is identifying genes that are changing over time or space, for example, and then characterizing the specific expression changes. A number of robust statistical methods are available to identify genes showing differential expression among multiple conditions, but most assume conditions are exchangeable and thereby sacrifice power and precision when applied to ordered data. Results: We propose an empirical Bayes mixture modeling approach called EBSeq-HMM. In EBSeq-HMM, an auto-regressive hidden Markov model is implemented to accommodate dependence in gene expression across ordered conditions. As demonstrated in simulation and case studies, the output proves useful in identifying differentially expressed genes and in specifying gene-specific expression paths. EBSeq-HMM may also be used for inference regarding isoform expression. Availability and implementation: An R package containing examples and sample datasets is available at Bioconductor. Contact: Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction With improvements in next-generation sequencing technologies and reductions in price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is characterizing how genes are changing over some factor with ordered levels (for example, ordered in time, in space, along a gradient, etc). C The Author 2015. Published by Oxford University Press. V For simplicity, we refer to any ordered RNA-seq experiment as a time-course experiment, noting that other similar designs may be analyzed within this framework; and we restrict attention to timecourse data collected within a single biological condition. In a time-course RNA-seq experiment, an investigator may be interested in genes that are monotonically increasing or decreasing, 2614 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact EBSeq-HMM: Identify expression changes in ordered RNA-seq experiments To address these considerations, we have developed an empirical Bayes auto-regressive hidden Markov model (HMM) based approach called EBSeq-HMM. The model extends our previous work, EBSeq, for identifying DE genes and isoforms across two or more biological conditions (Leng et al., 2013). As detailed in Methods, an auto-regressive process describes changes in expression over time, and a hidden Markov component is used to accommodate dependence. EBSeq-HMM allows users to identify genes with non-constant expression over multiple ordered conditions, and simultaneously classify them into expression paths. Results from a simulation study, detailed in Section 3.1, suggest that EBSeq-HMM has increased power over competing approaches for identifying genes following non-constant paths, especially for those genes showing subtle yet consistent changes over time. EBSeq-HMM also provides improved accuracy in classifying genes into expression paths. Similar results are demonstrated in a case study of the adult mouse limb presented in Section 3.2. 2 Methods 2.1 EBSeq-HMM: an empirical Bayes auto-regressive Hidden Markov model EBSeq-HMM requires estimates of gene or isoform expression collected over three or more ordered levels of a factor. The general model is presented for gene-level analysis; the isoform-level model is discussed in Section 2.3. To simplify the presentation, we refer to ordered levels as time points denoted by t ¼ 1; 2; . . . ; T, noting that the method directly accommodates other ordered data structures (e.g. ordered in space, along a gradient, etc.). Let Xt be a G Nt matrix of expression values for G genes in Nt samples at time t. The full set of observed expression values is then denoted by X ¼ ðX1 ; X2 ; . . . ; XT Þ. With a slight abuse of notation, let Xg denote one row of this matrix containing data for gene g over time; Xgtn denotes expression values for gene g at time t in sample n. Of interest are changes in the latent mean expression levels for gene g: lg1 ; lg2 ; . . . ; lgT . We allow for three possibilities, or states, to describe such changes: Up, Down, EE. If lt1 < lt , we define state SDt as Up; if lt1 > lt ; SDt is Down and lt1 ¼ lt defines SDt as EE. The main goals in an ordered RNA-seq experiment—identifying genes that change over time, and specifying each genes’ expression path— can be restated as questions about these underlying states. In short, for each gene g and each transition between t1 and t, we would like to estimate the probability of each state. A gene is said to follow a non-constant path if at least one state is not EE. We would also like to estimate the most likely expression path, which is given by D3 DT the configuration of expression states over time (SD2 g ; Sg ; . . . ; Sg ), noting that the most likely configuration of states need not equal the collection of states that define SDt g marginally at each t (an example is provided in Section 3.1). To make inference regarding these states, we propose a model for the set of expression measurements taken on a gene g. We make the common and well-supported assumption that gene expression in an RNA-seq experiment is well described by a NB distribution (Anders and Huber, 2010; Hardcastle and Kelly, 2010; Love et al., 2014; Nueda et al., 2014; Robinson et al., 2010; Trapnell et al., 2012). Were we to consider time t in isolation, this implies Xgtn jrgt ; qgt NBðrgt ; qgt Þ where the NB distribution may be parameterized such th (...truncated)