Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series
Mara Jose Nueda
2
Sonia Tarazona
0
1
Ana Conesa
1
Associate Editor: Ivo Hofacker
0
Applied Statistics, Operational Research and Quality Department, Polytechnic University of Valencia
, 46020 Valencia,
Spain
1
Genomics of Gene Expression Laboratory, Prince Felipe Research Centre
, 46012 Valencia,
Spain
2
Statistics and Operational Research Department, University of Alicante
, 03690, Alicante,
Spain
Motivation: The widespread adoption of RNA-seq to quantitatively measure gene expression has increased the scope of sequencing experimental designs to include time-course experiments. maSigPro is an R package specifically suited for the analysis of time-course gene expression data, which was developed originally for microarrays and hence was limited in its application to count data. Results: We have updated maSigPro to support RNA-seq time series analysis by introducing generalized linear models in the algorithm to support the modeling of count data while maintaining the traditional functionalities of the package. We show a good performance of the maSigPro-GLM method in several simulated time-course scenarios and in a real experimental dataset. Availability and implementation: The package is freely available under the LGPL license from the Bioconductor Web site (http:// bioconductor.org). Contact: or The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
-
Received on August 16, 2013; revised on May 7, 2014; accepted on
May 8, 2014
1 INTRODUCTION
The use of RNA-seq for transcriptome profiling as a replacement
for microarrays has triggered the development of statistical
methods to properly deal with the properties of these types of
count-based data. RNA-seq measurement of gene expression is
based on the number of reads mapped to transcripts, which
results in discrete quantities and left-skewed distributions. In
contrast, microarray signals are scanned fluorescence intensities, and
this translates into continuous and nearly normal expression
data. Although normality was typically assumed and linear
models (LMs) were applied to model microarray experiments,
other distributions such as Poisson and Negative Binomial
(NB) capture better the nature of count data. Hence, methods
such as edgeR (Robinson et al., 2010) and DEseq (Anders and
Huber, 2010) updated microarray analysis to RNA-seq by
incorporating appropriate statistical models, whereas other
methodologies were developed specifically for the new technology
(Roberts and Pachter, 2013; Tarazona et al., 2011; Trapnell
et al., 2012). Moreover, sequencing introduces specific biases to
*To whom correspondence should be addressed.
gene expression quantitation and, therefore, dedicated
normalization methods exist for RNA-seq to correct for sequencing
depth, transcript length (Mortazavi et al., 2008), GC content
(Risso et al., 2011) and non-uniform transcript distributions
(Bullard et al., 2010; Robinson and Oshlack, 2010).
The first RNA-seq experiments were still constrained by the
relatively high costs of sequencing in comparison with
microarrays, which restricted experimental designs to casecontrol
studies with low replication. As a consequence, the novel statistical
methods mostly addressed this analysis scenario. As the
technology became more affordable, other types of designs involving
more samples, such as time-course experiments, started to
appear. In a time-course study, the dynamics of gene expression
are evaluated at different time points after induction by a
particular treatment or in relation to development. Statistical
analysis of time-course data implies the identification of genes that
change their expression along time and/or follow a specific
expression pattern. maSigPro is an R package designed for the
analysis of transcriptomics time courses (Conesa et al., 2006).
maSigPro models gene expression by polynomial regression
and identifies expression changes along one or across several
time series by introducing dummy variables in the model. The
method progresses in two regression steps: the first one selects
genes with non-flat profiles and the second step creates best
regression models for each gene to identify specific time or
series-associated changes. The package includes several clustering
algorithms and visualization tools to group and display genes
with the same expression patterns. maSigPro has been applied
in many different biological settings, such as biomedicine
(Hoogerwerf et al., 2008), biotechnology (Levin et al., 2007)
and plant research (Terol et al., 2007) to cite some, and also
has been implemented in several web services (Medina et al.,
2010; Nueda et al., 2010) and used in combination with
multivariate statistics to analyze multifactorial designs (Nueda et al.,
2009) or as batch filtering technique (Nueda et al., 2012).
maSigPro was developed to treat continuous microarray
intensities and applies LMs to model gene expression. In this article,
we describe the update of maSigPro to deal with RNA-seq count
data by incorporating generalized linear models (GLMs;
Dobson, 2002; McCullagh and Nelder, 1989) into the package
and allowing a more flexible choice in the reference family
distribution. We demonstrate the appropriateness of this adaptation
using simulated and real data and compare the method with
edgeR that also accepts time-course designs.
METHODS
Model
Considering the case of a time-course experiment with T time points and
S experimental groups or series (e.g. different treatments, strains, tissues),
maSigPro uses polynomial regression to model the gene expression value
yi at condition i and time ti, and defines S 1 binary variables (zs) to
distinguish between each experimental group and a reference group
(Conesa et al., 2006). For the sake of simplicity and illustration of the
model, we consider here a quadratic regression and an experiment with
two series. The polynomial model of yi is
maSigPro originally supported only LMs, where the response variable is
modeled as a normal distribution. GLMs are a generalization of classical
LMs, which can accommodate a wider class of distributions named as
exponential family, providing great flexibility for modeling different types
of response variables. Normal, Poisson, Binomial, Gamma and NB are
examples of this family of distributions. These family classes have generic
definitions, which imply that a common maximum likelihood method for
estimating the parameters of the model can be applied to all of them.
Although explicit mathematical expressions can be found for estimators,
iterative numerical methods based on the NewtonRaphson are typically
used (Dobson, 2002; McCullagh and Nelder, 1989). In GLMs, hypothesis
testing and the goodness of fit of the model are based on the (...truncated)