Functional assessment of time course microarray data
Mara Jos Nueda
2
Patricia Sebastin
1
Sonia Tarazona
0
Francisco Garca- Garca
1
Joaqun Dopazo
1
3
4
Alberto Ferrer
0
Ana Conesa
1
0
Department of Applied Statistics and Operations Research, Universidad Politecnica of Valencia
,
Cno. vera s/n, Edifico 7A, 46022 Valencia
,
Spain
1
Bioinformatics and Genomics Department, Centro de Investigaciones Principe Felipe
,
Avda. Autopista Saler 16, 46012 Valencia
,
Spain
2
Department of Statistics and Operation Research, University of Alicante, Ctra. San Vicente del Raspeig
,
S/N 03690 Alicante
,
Spain
3
CIBER de Enfermedades Raras (CIBERER)
,
ISCIII
,
Spain
4
Functional Genomics Node (INB), Centro de Investigacion Principe Felipe (CIPF)
,
Valencia
,
Spain
Motivation: Time-course microarray experiments study the progress of gene expression along time across one or several experimental conditions. Most developed analysis methods focus on the clustering or the differential expression analysis of genes and do not integrate functional information. The assessment of the functional aspects of time-course transcriptomics data requires the use of approaches that exploit the activation dynamics of the functional categories to where genes are annotated. Methods: We present three novel methodologies for the functional assessment of time-course microarray data. i) maSigFun derives from the maSigPro method, a regression-based strategy to model time-dependent expression patterns and identify genes with differences across series. maSigFun fits a regression model for groups of genes labeled by a functional class and selects those categories which have a significant model. ii) PCA-maSigFun fits a PCA model of each functional class-defined expression matrix to extract orthogonal patterns of expression change, which are then assessed for their fit to a time-dependent regression model. iii) ASCA-functional uses the ASCA model to rank genes according to their correlation to principal time expression patterns and assess functional enrichment on a GSA fashion. We used simulated and experimental datasets to study these novel approaches. Results were compared to alternative methodologies. Results: Synthetic and experimental data showed that the different methods are able to capture different aspects of the relationship between genes, functions and co-expression that are biologically meaningful. The methods should not be considered as competitive but they provide different insights into the molecular and functional dynamic events taking place within the biological system under study.
-
from European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration
Martina Franca, Italy. 1820 September 2008
Background
Microarray time-course experiments have gained
popularity in recent years to address the study of biological
phenomena where the dynamics of gene expression is of
relevance. In contrast to classical control-case studies,
where basically two conditions are compared, time series
experiments encompass investigations of diverse nature
and complexity. Studies may relate to developmental
processes with a large number of sampling points (e.g.
[1]), or to stimuli-response experiments where
transcriptome changes are assessed in a short time span and may
include multiple treatments (e.g. [2]), or may try to
capture cyclic variations of gene expression (e.g. [3]).
Moreover, samples might be destroyed by the sampling process
or be taken from the same individuals along the time
component. This results in microarray time-course data
being classified as either short (up to 56 time points) or
long (from 67 time points) series, single (one treatment
or tissue) or multiple (more than one treatment or tissue)
series, and longitudinal vs. independent depending if
samples are blocked by an individual effect or are not related.
A significant number of statistical methods have been
published as microarray time-course experiments that
have been expanded to address the analysis of this type of
data. Many of the developed algorithms consider the
clustering of serial data. Proposed strategies include the use of
Gaussian mixed models [4], Bayesian models [5], Hidden
Markov Models [6], B-splines [7,8], and Fourier Series [9]
to model and cluster long series data, while more ad-hoc
algorithms have been developed for short series [10,11].
Another important block of methodologies are those that
pursue the identification of time-associated differentially
expressed genes (d.e.g.'s). In this category we find back the
B-spline approach [7,12] a multivariate adaptation of the
empirical Bayes test [13] to specifically treat longitudinal
data [14] and some ANOVA and regression-based models
for short series [15-18]. Finally, Conesa and co-workers
presented two methods well suited to independent,
multiple series data based either on step-wise regression or
singular component analysis [19,20].
In all of these approaches statistical analysis focused on
modeling gene expression and identifying those genes
with a relevant variation pattern. This orientation, though
valid and useful, solves only one (frequently the first)
requirement to understand transcriptomics changes from
any kind of microarray experiment. In most cases, the
analysis proceeds through the identification of cellular
processes and functions which are represented by the gene
selection, i.e. genes are identified by their functional role
and the question is then which functional modifications
can be derived from the observed gene regulation. The
incorporation of functional information into data
analysis is normally obtained by the use of functional
annotation databases that define and assign function labels to
known genes. The most widely used functional
annotation scheme is the Gene Ontology (GO) [21], which
characterizes genes for their molecular functions (MF), cellular
locations (CC) and involved biological processes (BP),
but others such as the KEGG metabolic pathways [22],
transcription factor targets [23] or Interpro functional
motifs [24] can also be employed for specific biological
questions. This functional assessment aspect is
traditionally handled in microarray data analysis via the so-called
enrichment analysis: the list of significant genes is
interrogated for over (and/or under) abundance, as compared to
the entire genome represented in the array of the
considered functional categories. In time-course microarray
data, this strategy could be similarly followed for the set
of time-dependent differentially expressed genes (for
example, as provided in the time course module of the
GEPAS suite, [25]), or for the distinct clusters into which
this gene selection can be divided (available in STEM
package, [26]). As a matter of fact, gene enrichment
analysis is very often used to validate the results of a gene
selection or a clustering strategy [27,28].
This strategy for the functional evaluation of differential
gene expression has a number of limitations [29]. Firstly,
the functional enrich (...truncated)