maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/22/9/1096.full.pdf

maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments

Ana Conesa 2 Mara Jose Nueda 1 Alberto Ferrer 0 Manuel Talo n 2 0 Departamento de Estad stica e Investigacio n Operativa Aplicadas y Calidad, Universidad Polite cnica de Valencia , Apartado 46022, Valencia , Spain 1 Departamento de Estad stica e Investigacio n Operativa. Universidad de Alicante. Apartado 03080 , Alicante Spain 2 Centro de Geno mica. Instituto Valenciano de Investigaciones Agrarias , Apartado Oficial 46113, Moncada, Valencia , Spain Motivation: Multi-series time-course microarray experiments are useful approaches for exploring biological processes. In this type of experiments, the researcher is frequently interested in studying gene expression changes along time and in evaluating trend differences between the various experimental groups. The large amount of data, multiplicity of experimental conditions and the dynamic nature of the experiments poses great challenges to data analysis. Results: In this work, we propose a statistical procedure to identify genes that show different gene expression profiles across analytical groups in time-course experiments. The method is a two-regression step approach where the experimental groups are identified by dummy variables. The procedure first adjusts a global regression model with all the defined variables to identify differentially expressed genes, and in second a variable selection strategy is applied to study differences between groups and to find statistically significant different profiles. The methodology is illustrated on both a real and a simulated microarray dataset. Availability: The method has been implemented in the statistical language R and is freely available from the Bioconductor contributed packages repository and from http://www.ivia.es/centrogenomica/ bioinformatics.htm Contact: ; The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: 1 INTRODUCTION A general approach in experimental life science research is to monitor the evolution over a period of time of biological phenomena as a response to specific stimuli. From a functional genomics point of view, the genome-wide study of temporal variations in gene expression aims to understand the molecular basis that control biological processes. Microarray technology allows to monitor the expression levels of thousands of genes simultaneously [see Draghici (2003) for an overview] and is therefore a very useful methodology to address the analysis of gene expression changes over time (microarray time course, MTC). The design of a typical time-course experiment often includes a number To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. of experimental treatments that are monitored through a relatively small (<6) number of time points. The researcher is then interested in detecting biologically meaningful gene expression trends and in spotting differences between the various experimental groups. Clustering methods, habitually used for the study of gene expression profiles, have been applied to the analysis of time-course data (Spellman et al., 1998; Lukashin et al., 2001). These methods cluster gene expression profiles on the basis of a distance metric and are valuable tools for the visualization of these data and for identifying groups of co-regulated genes (Draghici, 2003; Speed, 2003). In some cases, a statistical assesment for cluster significance is provided along with the clustering approach (Kerr and Churchill, 2001; Herrero et al., 2001), but in general these techniques do not offer an adequate framework to asses statistically significant trend differences between conditions. Furthermore, when a large number of genes is present in the dataset the interpretation of clustering results can be problematic. Therefore, it seems more convenient to apply first a statistical procedure to identify those genes with significant expression changes and subsequently divide the gene selection into clusters to visualize the results. Traditional statistical methods (t-statistic tests, ANOVA, etc.) have been applied to microarray data to identify differentially expressed genes (Pan, 2002; Kerr et al., 2000; Wolfinger et al., 2001). Refinements of these methods that take into account particular properties of gene expression data are now available. Some popular examples are SAM (Significance Analysis of Microarrays, Tusher et al., 2001) and LIMMA (Linear Models for Microarray Data, Smyth, 2004). These methods, although powerful and easy to use, are focused mainly on pairwise comparisons and their application to microarray time courses, specially when multiple series are present, might be tedious and uneffective to capture the dynamic nature of this type of data. The statistical analysis of microarray time-course data has been reviewed by Bar-Joseph (2004). A large number of currently available methods is devoted to the identification and clustering of gene expression patterns, and for the deciphering of gene regulatory networks (references in Bar-Joseph, 2004; Peddada et al., 2003; Luan and Li, 2003; Liu et al. 2005; Ernst et al. 2005 and Beal et al. 2005). However, few methodologies can be found that address the problem of finding statistical profile differences between experimental groups. Bar-Joseph et al. (2003) obtained a selection of differentially expressed genes between two cell-cycle microarray datasets by computing a difference measure between the continuous representations of the two time series expression data. This method can be successfully applied to the analysis of long time series (>10 time points) but its adequateness for shorter time-course experiment is not clear (Bar-Joseph, 2004). ANOVA-based models have also been proposed (Park et al., 2003). ANOVA can easily model multilevel factors and their interactions. However, when analysing models containing quantitative variables or experiments with unbalanced designs traditional ANOVA procedures are not appropriate and specific modifications have to be incorporated. Regression approaches appear to be a more straightforward and flexible solution for the analysis of this type of data. Regression methods treat time as a quantitative variable, and therefore not only differentially expressed genes can be detected, but also changes in trends can be discovered and their magnitude can be studied by analysing the coefficients of the model. A regression model approach was used by Xu et al. (2002) to identify differential gene profiles in an inducible transgenic model. Their method introduced specific variables in the regression to capture particular properties of the data under study. This tailor-made approach can be very useful to evaluate specific gene expression behaviours but it implies redefining the variables for other biological systems. In this work we propose a general regression-based approach for the analysis of single or multip (...truncated)