maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments
BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 9 2006, pages 1096–1102
doi:10.1093/bioinformatics/btl056
Gene expression
maSigPro: a method to identify significantly differential
expression profiles in time-course microarray experiments
Ana Conesa1,†, , Marı́a José Nueda2,†, Alberto Ferrer3 and Manuel Talón1
1
Received on November 9, 2005; revised on February 1, 2006; accepted on February 10, 2006
Advance Access publication February 15, 2006
Associate Editor: David Rocke
ABSTRACT
Motivation: Multi-series time-course microarray experiments are
useful approaches for exploring biological processes. In this type of
experiments, the researcher is frequently interested in studying
gene expression changes along time and in evaluating trend
differences between the various experimental groups. The large
amount of data, multiplicity of experimental conditions and the
dynamic nature of the experiments poses great challenges to data
analysis.
Results: In this work, we propose a statistical procedure to identify
genes that show different gene expression profiles across analytical
groups in time-course experiments. The method is a two-regression
step approach where the experimental groups are identified by dummy
variables. The procedure first adjusts a global regression model with all
the defined variables to identify differentially expressed genes, and in
second a variable selection strategy is applied to study differences
between groups and to find statistically significant different profiles.
The methodology is illustrated on both a real and a simulated microarray
dataset.
Availability: The method has been implemented in the statistical
language R and is freely available from the Bioconductor contributed
packages repository and from http://www.ivia.es/centrogenomica/
bioinformatics.htm
Contact: ;
1
INTRODUCTION
A general approach in experimental life science research is to
monitor the evolution over a period of time of biological phenomena
as a response to specific stimuli. From a functional genomics
point of view, the genome-wide study of temporal variations in
gene expression aims to understand the molecular basis that
control biological processes. Microarray technology allows to
monitor the expression levels of thousands of genes simultaneously
[see Draghici (2003) for an overview] and is therefore a very
useful methodology to address the analysis of gene expression
changes over time (microarray time course, MTC). The design
of a typical time-course experiment often includes a number
To whom correspondence should be addressed.
†The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
1096
of experimental treatments that are monitored through a
relatively small (<6) number of time points. The researcher is
then interested in detecting biologically meaningful gene expression
trends and in spotting differences between the various experimental
groups.
Clustering methods, habitually used for the study of gene expression profiles, have been applied to the analysis of time-course data
(Spellman et al., 1998; Lukashin et al., 2001). These methods
cluster gene expression profiles on the basis of a distance metric
and are valuable tools for the visualization of these data and
for identifying groups of co-regulated genes (Draghici, 2003;
Speed, 2003). In some cases, a statistical assesment for cluster
significance is provided along with the clustering approach (Kerr
and Churchill, 2001; Herrero et al., 2001), but in general these
techniques do not offer an adequate framework to asses statistically
significant trend differences between conditions. Furthermore,
when a large number of genes is present in the dataset the interpretation of clustering results can be problematic. Therefore, it
seems more convenient to apply first a statistical procedure to
identify those genes with significant expression changes and
subsequently divide the gene selection into clusters to visualize
the results.
Traditional statistical methods (t-statistic tests, ANOVA, etc.)
have been applied to microarray data to identify differentially
expressed genes (Pan, 2002; Kerr et al., 2000; Wolfinger et al.,
2001). Refinements of these methods that take into account particular properties of gene expression data are now available. Some
popular examples are SAM (Significance Analysis of Microarrays,
Tusher et al., 2001) and LIMMA (Linear Models for Microarray
Data, Smyth, 2004). These methods, although powerful and easy
to use, are focused mainly on pairwise comparisons and their
application to microarray time courses, specially when multiple
series are present, might be tedious and uneffective to capture
the dynamic nature of this type of data.
The statistical analysis of microarray time-course data has been
reviewed by Bar-Joseph (2004). A large number of currently available methods is devoted to the identification and clustering of gene
expression patterns, and for the deciphering of gene regulatory
networks (references in Bar-Joseph, 2004; Peddada et al., 2003;
Luan and Li, 2003; Liu et al. 2005; Ernst et al. 2005 and Beal
et al. 2005). However, few methodologies can be found that
address the problem of finding statistical profile differences between
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
Centro de Genómica. Instituto Valenciano de Investigaciones Agrarias, Apartado Oficial 46113, Moncada,
Valencia, Spain, 2Departamento de Estadı́stica e Investigación Operativa. Universidad de Alicante.
Apartado 03080, Alicante Spain and 3Departamento de Estadı́stica e Investigación Operativa Aplicadas y Calidad,
Universidad Politécnica de Valencia, Apartado 46022, Valencia, Spain
Analysis of time-course microarray data
2
2.1
METHODS
Definition of the model
In the problem we are considering there are normally two or more variables
of interest. One of them is typically the time, which is a quantitative variable
(in the type of experiments considered for this approach, time is usually
the independent variable, however the methodology would accept as well
other experimental continuous variables, such as a quantified physiological
parameter). The other variables are usually qualitative variables (e.g. different treatments, strains, tissues, etc.) and represent the experimental
groups for which temporal gene expression differences are sought. For
clarity in the exposition, only one qualitative variable or factor will be
considered here.
Let there be I experimental groups described by the qualitative variable
evaluated at J time points for each particular condition ij (i ¼ 1, . . . , I and
j ¼ 1, . . . , J ). Assume that gene expression is measured for N genes in
Rij replicated hybridizations.
We define I 1 dummy variables (binary variables) to distinguish
between each group and a reference group (Table 1).
Let yijr denote the normalized and transformed expression value from
each (...truncated)