maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments
Ana Conesa
2
Mara Jose Nueda
1
Alberto Ferrer
0
Manuel Talo n
2
0
Departamento de Estad stica e Investigacio n Operativa Aplicadas y Calidad, Universidad Polite cnica de Valencia
,
Apartado 46022, Valencia
,
Spain
1
Departamento de Estad stica e Investigacio n Operativa. Universidad de Alicante. Apartado 03080
,
Alicante
Spain
2
Centro de Geno mica. Instituto Valenciano de Investigaciones Agrarias
,
Apartado Oficial 46113, Moncada, Valencia
,
Spain
Motivation: Multi-series time-course microarray experiments are useful approaches for exploring biological processes. In this type of experiments, the researcher is frequently interested in studying gene expression changes along time and in evaluating trend differences between the various experimental groups. The large amount of data, multiplicity of experimental conditions and the dynamic nature of the experiments poses great challenges to data analysis. Results: In this work, we propose a statistical procedure to identify genes that show different gene expression profiles across analytical groups in time-course experiments. The method is a two-regression step approach where the experimental groups are identified by dummy variables. The procedure first adjusts a global regression model with all the defined variables to identify differentially expressed genes, and in second a variable selection strategy is applied to study differences between groups and to find statistically significant different profiles. The methodology is illustrated on both a real and a simulated microarray dataset. Availability: The method has been implemented in the statistical language R and is freely available from the Bioconductor contributed packages repository and from http://www.ivia.es/centrogenomica/ bioinformatics.htm Contact: ; The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
1 INTRODUCTION
A general approach in experimental life science research is to
monitor the evolution over a period of time of biological phenomena
as a response to specific stimuli. From a functional genomics
point of view, the genome-wide study of temporal variations in
gene expression aims to understand the molecular basis that
control biological processes. Microarray technology allows to
monitor the expression levels of thousands of genes simultaneously
[see Draghici (2003) for an overview] and is therefore a very
useful methodology to address the analysis of gene expression
changes over time (microarray time course, MTC). The design
of a typical time-course experiment often includes a number
To whom correspondence should be addressed.
The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
of experimental treatments that are monitored through a
relatively small (<6) number of time points. The researcher is
then interested in detecting biologically meaningful gene expression
trends and in spotting differences between the various experimental
groups.
Clustering methods, habitually used for the study of gene
expression profiles, have been applied to the analysis of time-course data
(Spellman et al., 1998; Lukashin et al., 2001). These methods
cluster gene expression profiles on the basis of a distance metric
and are valuable tools for the visualization of these data and
for identifying groups of co-regulated genes (Draghici, 2003;
Speed, 2003). In some cases, a statistical assesment for cluster
significance is provided along with the clustering approach (Kerr
and Churchill, 2001; Herrero et al., 2001), but in general these
techniques do not offer an adequate framework to asses statistically
significant trend differences between conditions. Furthermore,
when a large number of genes is present in the dataset the
interpretation of clustering results can be problematic. Therefore, it
seems more convenient to apply first a statistical procedure to
identify those genes with significant expression changes and
subsequently divide the gene selection into clusters to visualize
the results.
Traditional statistical methods (t-statistic tests, ANOVA, etc.)
have been applied to microarray data to identify differentially
expressed genes (Pan, 2002; Kerr et al., 2000; Wolfinger et al.,
2001). Refinements of these methods that take into account
particular properties of gene expression data are now available. Some
popular examples are SAM (Significance Analysis of Microarrays,
Tusher et al., 2001) and LIMMA (Linear Models for Microarray
Data, Smyth, 2004). These methods, although powerful and easy
to use, are focused mainly on pairwise comparisons and their
application to microarray time courses, specially when multiple
series are present, might be tedious and uneffective to capture
the dynamic nature of this type of data.
The statistical analysis of microarray time-course data has been
reviewed by Bar-Joseph (2004). A large number of currently
available methods is devoted to the identification and clustering of gene
expression patterns, and for the deciphering of gene regulatory
networks (references in Bar-Joseph, 2004; Peddada et al., 2003;
Luan and Li, 2003; Liu et al. 2005; Ernst et al. 2005 and Beal
et al. 2005). However, few methodologies can be found that
address the problem of finding statistical profile differences between
experimental groups. Bar-Joseph et al. (2003) obtained a selection
of differentially expressed genes between two cell-cycle
microarray datasets by computing a difference measure between the
continuous representations of the two time series expression
data. This method can be successfully applied to the analysis
of long time series (>10 time points) but its adequateness for
shorter time-course experiment is not clear (Bar-Joseph, 2004).
ANOVA-based models have also been proposed (Park et al.,
2003). ANOVA can easily model multilevel factors and their
interactions. However, when analysing models containing
quantitative variables or experiments with unbalanced designs traditional
ANOVA procedures are not appropriate and specific
modifications have to be incorporated. Regression approaches appear to
be a more straightforward and flexible solution for the analysis
of this type of data. Regression methods treat time as a quantitative
variable, and therefore not only differentially expressed genes
can be detected, but also changes in trends can be discovered
and their magnitude can be studied by analysing the coefficients
of the model. A regression model approach was used by Xu
et al. (2002) to identify differential gene profiles in an inducible
transgenic model. Their method introduced specific variables in the
regression to capture particular properties of the data under study.
This tailor-made approach can be very useful to evaluate
specific gene expression behaviours but it implies redefining
the variables for other biological systems. In this work we propose
a general regression-based approach for the analysis of single or
multip (...truncated)