Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/30/18/2598.full.pdf

Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series

Mara Jose Nueda 2 Sonia Tarazona 0 1 Ana Conesa 1 Associate Editor: Ivo Hofacker 0 Applied Statistics, Operational Research and Quality Department, Polytechnic University of Valencia , 46020 Valencia, Spain 1 Genomics of Gene Expression Laboratory, Prince Felipe Research Centre , 46012 Valencia, Spain 2 Statistics and Operational Research Department, University of Alicante , 03690, Alicante, Spain Motivation: The widespread adoption of RNA-seq to quantitatively measure gene expression has increased the scope of sequencing experimental designs to include time-course experiments. maSigPro is an R package specifically suited for the analysis of time-course gene expression data, which was developed originally for microarrays and hence was limited in its application to count data. Results: We have updated maSigPro to support RNA-seq time series analysis by introducing generalized linear models in the algorithm to support the modeling of count data while maintaining the traditional functionalities of the package. We show a good performance of the maSigPro-GLM method in several simulated time-course scenarios and in a real experimental dataset. Availability and implementation: The package is freely available under the LGPL license from the Bioconductor Web site (http:// bioconductor.org). Contact: or The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. - Received on August 16, 2013; revised on May 7, 2014; accepted on May 8, 2014 1 INTRODUCTION The use of RNA-seq for transcriptome profiling as a replacement for microarrays has triggered the development of statistical methods to properly deal with the properties of these types of count-based data. RNA-seq measurement of gene expression is based on the number of reads mapped to transcripts, which results in discrete quantities and left-skewed distributions. In contrast, microarray signals are scanned fluorescence intensities, and this translates into continuous and nearly normal expression data. Although normality was typically assumed and linear models (LMs) were applied to model microarray experiments, other distributions such as Poisson and Negative Binomial (NB) capture better the nature of count data. Hence, methods such as edgeR (Robinson et al., 2010) and DEseq (Anders and Huber, 2010) updated microarray analysis to RNA-seq by incorporating appropriate statistical models, whereas other methodologies were developed specifically for the new technology (Roberts and Pachter, 2013; Tarazona et al., 2011; Trapnell et al., 2012). Moreover, sequencing introduces specific biases to *To whom correspondence should be addressed. gene expression quantitation and, therefore, dedicated normalization methods exist for RNA-seq to correct for sequencing depth, transcript length (Mortazavi et al., 2008), GC content (Risso et al., 2011) and non-uniform transcript distributions (Bullard et al., 2010; Robinson and Oshlack, 2010). The first RNA-seq experiments were still constrained by the relatively high costs of sequencing in comparison with microarrays, which restricted experimental designs to casecontrol studies with low replication. As a consequence, the novel statistical methods mostly addressed this analysis scenario. As the technology became more affordable, other types of designs involving more samples, such as time-course experiments, started to appear. In a time-course study, the dynamics of gene expression are evaluated at different time points after induction by a particular treatment or in relation to development. Statistical analysis of time-course data implies the identification of genes that change their expression along time and/or follow a specific expression pattern. maSigPro is an R package designed for the analysis of transcriptomics time courses (Conesa et al., 2006). maSigPro models gene expression by polynomial regression and identifies expression changes along one or across several time series by introducing dummy variables in the model. The method progresses in two regression steps: the first one selects genes with non-flat profiles and the second step creates best regression models for each gene to identify specific time or series-associated changes. The package includes several clustering algorithms and visualization tools to group and display genes with the same expression patterns. maSigPro has been applied in many different biological settings, such as biomedicine (Hoogerwerf et al., 2008), biotechnology (Levin et al., 2007) and plant research (Terol et al., 2007) to cite some, and also has been implemented in several web services (Medina et al., 2010; Nueda et al., 2010) and used in combination with multivariate statistics to analyze multifactorial designs (Nueda et al., 2009) or as batch filtering technique (Nueda et al., 2012). maSigPro was developed to treat continuous microarray intensities and applies LMs to model gene expression. In this article, we describe the update of maSigPro to deal with RNA-seq count data by incorporating generalized linear models (GLMs; Dobson, 2002; McCullagh and Nelder, 1989) into the package and allowing a more flexible choice in the reference family distribution. We demonstrate the appropriateness of this adaptation using simulated and real data and compare the method with edgeR that also accepts time-course designs. METHODS Model Considering the case of a time-course experiment with T time points and S experimental groups or series (e.g. different treatments, strains, tissues), maSigPro uses polynomial regression to model the gene expression value yi at condition i and time ti, and defines S 1 binary variables (zs) to distinguish between each experimental group and a reference group (Conesa et al., 2006). For the sake of simplicity and illustration of the model, we consider here a quadratic regression and an experiment with two series. The polynomial model of yi is maSigPro originally supported only LMs, where the response variable is modeled as a normal distribution. GLMs are a generalization of classical LMs, which can accommodate a wider class of distributions named as exponential family, providing great flexibility for modeling different types of response variables. Normal, Poisson, Binomial, Gamma and NB are examples of this family of distributions. These family classes have generic definitions, which imply that a common maximum likelihood method for estimating the parameters of the model can be applied to all of them. Although explicit mathematical expressions can be found for estimators, iterative numerical methods based on the NewtonRaphson are typically used (Dobson, 2002; McCullagh and Nelder, 1989). In GLMs, hypothesis testing and the goodness of fit of the model are based on the (...truncated)