maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/22/9/1096/48841383/bioinformatics_22_9_1096.pdf

maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments

BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 9 2006, pages 1096–1102 doi:10.1093/bioinformatics/btl056 Gene expression maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments Ana Conesa1,†, , Marı́a José Nueda2,†, Alberto Ferrer3 and Manuel Talón1 1 Received on November 9, 2005; revised on February 1, 2006; accepted on February 10, 2006 Advance Access publication February 15, 2006 Associate Editor: David Rocke ABSTRACT Motivation: Multi-series time-course microarray experiments are useful approaches for exploring biological processes. In this type of experiments, the researcher is frequently interested in studying gene expression changes along time and in evaluating trend differences between the various experimental groups. The large amount of data, multiplicity of experimental conditions and the dynamic nature of the experiments poses great challenges to data analysis. Results: In this work, we propose a statistical procedure to identify genes that show different gene expression profiles across analytical groups in time-course experiments. The method is a two-regression step approach where the experimental groups are identified by dummy variables. The procedure first adjusts a global regression model with all the defined variables to identify differentially expressed genes, and in second a variable selection strategy is applied to study differences between groups and to find statistically significant different profiles. The methodology is illustrated on both a real and a simulated microarray dataset. Availability: The method has been implemented in the statistical language R and is freely available from the Bioconductor contributed packages repository and from http://www.ivia.es/centrogenomica/ bioinformatics.htm Contact: ; 1 INTRODUCTION A general approach in experimental life science research is to monitor the evolution over a period of time of biological phenomena as a response to specific stimuli. From a functional genomics point of view, the genome-wide study of temporal variations in gene expression aims to understand the molecular basis that control biological processes. Microarray technology allows to monitor the expression levels of thousands of genes simultaneously [see Draghici (2003) for an overview] and is therefore a very useful methodology to address the analysis of gene expression changes over time (microarray time course, MTC). The design of a typical time-course experiment often includes a number To whom correspondence should be addressed. †The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. 1096 of experimental treatments that are monitored through a relatively small (<6) number of time points. The researcher is then interested in detecting biologically meaningful gene expression trends and in spotting differences between the various experimental groups. Clustering methods, habitually used for the study of gene expression profiles, have been applied to the analysis of time-course data (Spellman et al., 1998; Lukashin et al., 2001). These methods cluster gene expression profiles on the basis of a distance metric and are valuable tools for the visualization of these data and for identifying groups of co-regulated genes (Draghici, 2003; Speed, 2003). In some cases, a statistical assesment for cluster significance is provided along with the clustering approach (Kerr and Churchill, 2001; Herrero et al., 2001), but in general these techniques do not offer an adequate framework to asses statistically significant trend differences between conditions. Furthermore, when a large number of genes is present in the dataset the interpretation of clustering results can be problematic. Therefore, it seems more convenient to apply first a statistical procedure to identify those genes with significant expression changes and subsequently divide the gene selection into clusters to visualize the results. Traditional statistical methods (t-statistic tests, ANOVA, etc.) have been applied to microarray data to identify differentially expressed genes (Pan, 2002; Kerr et al., 2000; Wolfinger et al., 2001). Refinements of these methods that take into account particular properties of gene expression data are now available. Some popular examples are SAM (Significance Analysis of Microarrays, Tusher et al., 2001) and LIMMA (Linear Models for Microarray Data, Smyth, 2004). These methods, although powerful and easy to use, are focused mainly on pairwise comparisons and their application to microarray time courses, specially when multiple series are present, might be tedious and uneffective to capture the dynamic nature of this type of data. The statistical analysis of microarray time-course data has been reviewed by Bar-Joseph (2004). A large number of currently available methods is devoted to the identification and clustering of gene expression patterns, and for the deciphering of gene regulatory networks (references in Bar-Joseph, 2004; Peddada et al., 2003; Luan and Li, 2003; Liu et al. 2005; Ernst et al. 2005 and Beal et al. 2005). However, few methodologies can be found that address the problem of finding statistical profile differences between The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: Centro de Genómica. Instituto Valenciano de Investigaciones Agrarias, Apartado Oficial 46113, Moncada, Valencia, Spain, 2Departamento de Estadı́stica e Investigación Operativa. Universidad de Alicante. Apartado 03080, Alicante Spain and 3Departamento de Estadı́stica e Investigación Operativa Aplicadas y Calidad, Universidad Politécnica de Valencia, Apartado 46022, Valencia, Spain Analysis of time-course microarray data 2 2.1 METHODS Definition of the model In the problem we are considering there are normally two or more variables of interest. One of them is typically the time, which is a quantitative variable (in the type of experiments considered for this approach, time is usually the independent variable, however the methodology would accept as well other experimental continuous variables, such as a quantified physiological parameter). The other variables are usually qualitative variables (e.g. different treatments, strains, tissues, etc.) and represent the experimental groups for which temporal gene expression differences are sought. For clarity in the exposition, only one qualitative variable or factor will be considered here. Let there be I experimental groups described by the qualitative variable evaluated at J time points for each particular condition ij (i ¼ 1, . . . , I and j ¼ 1, . . . , J ). Assume that gene expression is measured for N genes in Rij replicated hybridizations. We define I 1 dummy variables (binary variables) to distinguish between each group and a reference group (Table 1). Let yijr denote the normalized and transformed expression value from each (...truncated)