An information theoretic approach for analyzing temporal patterns of gene expression
Jyotsna Kasturi
1
Raj Acharya
1
Murali Ramanathan
0
0
Department of Pharmaceutical Sciences, State University of New York at Buffalo
, Buffalo,
NY 14260-1200, USA
1
Department of Computer Science and Engineering, Pennsylvania State University, University Park
,
PA 16802
Motivation: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. Results: The Kullback-Leibler divergence, an informationtheoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. Availability: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA). Contact: ;
-
INTRODUCTION
DNA arrays measure expression levels of thousands of
mRNAs in a single experiment. Because there are
approximately 30 000 genes in the human genome, a limited
number of arrays could make comprehensive expression
profiling feasible in the near future. Although DNA array
experiments are information rich, they require extensive
data mining to identify the patterns that characterize the
underlying mechanisms of action.
The kinetics of gene expression are commonly
examined in many experimental designs to delineate the
temporal sequence of transcriptional events that occur in
response to a given stimulus. The identification of groups
of genes with similar temporal patterns of expression is
usually a critical step in the analysis of kinetic data
because it provides insights into the genegene interactions
and thereby facilitates the testing and development of
mechanistic models for the regulation of the underlying
biological processes. Array experiments in cellular
models suggest that certain genes with similar function exhibit
similar temporal patterns of co-regulation (Eisen et al.,
1998; Spellman et al., 1998).
Supervised and unsupervised cluster analysis
techniques with a variety of distance measures and
decisiongenerating algorithms have been extensively explored
for the analysis of gene array data (Eisen et al., 1998;
Brazma and Vilo, 2000; Sherlock, 2000). The expression
levels of various mRNAs can differ by several orders of
magnitude and the Pearson correlation is widely used as
a distance measure for analyzing the kinetics of gene
expression, since it is capable of identifying visually similar
expression patterns. A new jackknife procedure has been
proposed wherein each observation is sequentially deleted
and the minimum value from the set of correlation values
is used for cluster analysis (Heyer et al., 1999).
Here, we assess the performance of unsupervised
clustering with the KullbackLeibler (KL) divergence from
information theory that measures the relative
dissimilarity of the shapes of the two profiles being compared as an
alternative to the more commonly used hierarchical
clustering (HC) algorithm with the Pearson correlation
measure. Although the KL divergence has several interesting
properties, it has not been extensively explored for gene
expression data analysis applications.
SYSTEMS AND METHODS
Relative entropy
The relative entropy or KullbackLeibler divergence
between two probability mass functions p(x ) and q(x ) over
the random variable X , is defined as (Cover and Thomas,
1991):
D( p q) =
xX
The KL divergence D( p q) is a measure of the distance
between two distributions or equivalently, it is the
inefficiency of assuming that the distribution of X is q when
the true distribution is p. The KL divergence always takes
non-negative values, and is zero if and only if p = q. It is
not symmetric and does not satisfy the triangle inequality.
The KL divergence, however, has several important and
useful properties namely: (i) convergence in the KL sense
implies convergence in the L1 norm sense but no proof is
known for the reverse; (ii) the 2 statistic is twice the first
term in the Taylor expansion of the KL divergence; and
(iii) D( p||q) is convex in the pair ( p, q).
KullbackLeibler (KL) clustering
The KL Clustering method is a two-step process where
the data is first normalized and is then classified using
a one-dimensional self-organizing map (SOM) with the
KL divergence as the dissimilarity measure used for
clustering.
Data normalization The kinetic profile for each gene is
converted to its corresponding probability mass function
by calculating the fractional contribution of the expression
level at each time point to the total of the expression levels
for all time points for that gene. The result is an array
that is suitable for KL clustering because the normalized
expression values for each gene fall in the interval [0, 1]
and each row sum is 1 (unit total probability mass).
These normalized data are used as input to the KL-based
clustering algorithm.
SOM clustering method The self-organizing neural
network is an algorithm in which the input probability
distribution is eventually reproduced as closely as possible
from a sequence of inputs (Kohonen, 1989). In response
to the input, the neurons in the network iteratively adjust,
or self-organize the synaptic weights that connect them
to their neighbors to estimate the input distribution. The
SOM requires an initial set of weights, points in T
dimensional space, where T is the number of time points.
The weights used here are random vectors sampled from
uniform distributions restricted at each time point to the
range of the data.
SOM training A SOM is trained using an iterative
process during which the distance between each gene
profile and the existing cluster centers is computed. We
used the Kohonen training rule (Kohonen, 1995). The
pseudocode below describes the algorithm used.
Initialize the N cluster centers: c1, c2, . . . , cN
Repeat steps 1 through 4 until convergence is reached.
For iteration n:
1. Select a gene g from the normalized data.
2. Calculate the distance di from the gene to each cluster center ci .
di = D(g||ci ), i = 1, 2, . . . , N
3 Identify the cluster k closest to g.
k = arg min{di }
i
4. Update the weights of the kth cluster and its immediate
neighbors using the following learning rule:
a) ck (n + 1) = ck (n) + 1(n)(g c (...truncated)