Identification of gene expression patterns using planned linear contrasts
Hao Li
2
Constance L Wood
2
Yushu Liu
2
Thomas V Getchell
1
3
Marilyn L Getchell
0
3
Arnold J Stromberg
2
0
Department of Anatomy and Neurobiology, College of Medicine
,
Lexington, KY40536-0298
,
USA
1
Department of Physiology, College of Medicine
,
Lexington, KY40536-0298
,
USA
2
Department of Statistics, University of Kentucky
,
817 Patterson Office Tower, Lexington, KY40536-0027
,
USA
3
309 Sanders-Brown Center on Aging, University of Kentucky Medical Center
,
Lexington, KY40536-0230
,
USA
Background: In gene networks, the timing of significant changes in the expression level of each gene may be the most critical information in time course expression profiles. With the same timing of the initial change, genes which share similar patterns of expression for any number of sampling intervals from the beginning should be considered co-expressed at certain level(s) in the gene networks. In addition, multiple testing problems are complicated in experiments with multi-level treatments when thousands of genes are involved. Results: To address these issues, we first performed an ANOVA F test to identify significantly regulated genes. The Benjamini and Hochberg (BH) procedure of controlling false discovery rate (FDR) at 5% was applied to the P values of the F test. We then categorized the genes with a significant F test into 4 classes based on the timing of their initial responses by sequentially testing a complete set of orthogonal contrasts, the reverse Helmert series. For genes within each class, specific sequences of contrasts were performed to characterize their general 'fluctuation' shapes of expression along the subsequent sampling time points. To be consistent with the BH procedure, each contrast was examined using a stepwise Studentized Maximum Modulus test to control the gene based maximum family-wise error rate (MFWER) at the level of new determined by the BH procedure. We demonstrated our method on the analysis of microarray data from murine olfactory sensory epithelia at five different time points after target ablation. Conclusion: In this manuscript, we used planned linear contrasts to analyze time-course microarray experiments. This analysis allowed us to characterize gene expression patterns based on the temporal order in the data, the timing of a gene's initial response, and the general shapes of gene expression patterns along the subsequent sampling time points. Our method is particularly suitable for analysis of microarray experiments in which it is often difficult to take sufficiently frequent measurements and/or the sampling intervals are non-uniform.
-
Background
Recent advances in DNA microarray technologies have
made it possible to investigate the transcriptional portion
of gene networks in a variety of organisms. When
microarray experiments are performed to monitor gene
expression over time, researchers can address questions
concerning the detection of the cellular processes
underlying the observed regulatory effects, inference of regulatory
networks and, ultimately, assignment of functions to the
genes analyzed in the time courses.
There is a natural connection between gene function and
gene expression. Based on our understanding of cellular
processes, genes that are contained in a particular
pathway, or respond to a common internal or external
stimulus, should be co-regulated and consequently, should
show similar patterns of expression. Therefore, identifying
patterns of gene expression and grouping genes into
expression classes may provide much greater insight into
their biological functions. A large group of statistical
methods, generally referred to as "cluster analysis", have
been developed to identify genes that behave similarly
across a range of experimental conditions, including time
courses. These statistical algorithms can be divided into
two classes, depending on whether they are based on
'similarity' measures or not. Methods based on 'similarity'
measures rely on defining a distance (or 'dissimilarity')
between gene expression vectors; Euclidean distance and/
or the Pearson correlation coefficient are the two most
commonly used distance measures. Examples of
similarity measures-based methods are hierarchical clustering
[1], k-means [2], self-organization maps (SOM) [3,4], and
support vector machine (SVM) [5]. These methods do not
consider the temporal structure of the data when used to
analyze time-course experiments. In addition, some
methods could confuse the clusters because the actual
expression patterns of the genes themselves become less
relevant as clusters grow in size [6].
The clustering methods in the second class are based on
statistical models, without defining a 'similarity' measure.
Using statistical models to represent clusters changes the
question from how close two data points are to how likely
a given data point is under the model. Such clustering
methods are more commonly used to analyze time-course
microarray experiments. Examples of such methods ar (...truncated)