An information theoretic approach for analyzing temporal patterns of gene expression (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/19/4/449.full.pdf

An information theoretic approach for analyzing temporal patterns of gene expression

Jyotsna Kasturi 1 Raj Acharya 1 Murali Ramanathan 0 0 Department of Pharmaceutical Sciences, State University of New York at Buffalo , Buffalo, NY 14260-1200, USA 1 Department of Computer Science and Engineering, Pennsylvania State University, University Park , PA 16802 Motivation: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. Results: The Kullback-Leibler divergence, an informationtheoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. Availability: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA). Contact: ; - INTRODUCTION DNA arrays measure expression levels of thousands of mRNAs in a single experiment. Because there are approximately 30 000 genes in the human genome, a limited number of arrays could make comprehensive expression profiling feasible in the near future. Although DNA array experiments are information rich, they require extensive data mining to identify the patterns that characterize the underlying mechanisms of action. The kinetics of gene expression are commonly examined in many experimental designs to delineate the temporal sequence of transcriptional events that occur in response to a given stimulus. The identification of groups of genes with similar temporal patterns of expression is usually a critical step in the analysis of kinetic data because it provides insights into the genegene interactions and thereby facilitates the testing and development of mechanistic models for the regulation of the underlying biological processes. Array experiments in cellular models suggest that certain genes with similar function exhibit similar temporal patterns of co-regulation (Eisen et al., 1998; Spellman et al., 1998). Supervised and unsupervised cluster analysis techniques with a variety of distance measures and decisiongenerating algorithms have been extensively explored for the analysis of gene array data (Eisen et al., 1998; Brazma and Vilo, 2000; Sherlock, 2000). The expression levels of various mRNAs can differ by several orders of magnitude and the Pearson correlation is widely used as a distance measure for analyzing the kinetics of gene expression, since it is capable of identifying visually similar expression patterns. A new jackknife procedure has been proposed wherein each observation is sequentially deleted and the minimum value from the set of correlation values is used for cluster analysis (Heyer et al., 1999). Here, we assess the performance of unsupervised clustering with the KullbackLeibler (KL) divergence from information theory that measures the relative dissimilarity of the shapes of the two profiles being compared as an alternative to the more commonly used hierarchical clustering (HC) algorithm with the Pearson correlation measure. Although the KL divergence has several interesting properties, it has not been extensively explored for gene expression data analysis applications. SYSTEMS AND METHODS Relative entropy The relative entropy or KullbackLeibler divergence between two probability mass functions p(x ) and q(x ) over the random variable X , is defined as (Cover and Thomas, 1991): D( p q) = xX The KL divergence D( p q) is a measure of the distance between two distributions or equivalently, it is the inefficiency of assuming that the distribution of X is q when the true distribution is p. The KL divergence always takes non-negative values, and is zero if and only if p = q. It is not symmetric and does not satisfy the triangle inequality. The KL divergence, however, has several important and useful properties namely: (i) convergence in the KL sense implies convergence in the L1 norm sense but no proof is known for the reverse; (ii) the 2 statistic is twice the first term in the Taylor expansion of the KL divergence; and (iii) D( p||q) is convex in the pair ( p, q). KullbackLeibler (KL) clustering The KL Clustering method is a two-step process where the data is first normalized and is then classified using a one-dimensional self-organizing map (SOM) with the KL divergence as the dissimilarity measure used for clustering. Data normalization The kinetic profile for each gene is converted to its corresponding probability mass function by calculating the fractional contribution of the expression level at each time point to the total of the expression levels for all time points for that gene. The result is an array that is suitable for KL clustering because the normalized expression values for each gene fall in the interval [0, 1] and each row sum is 1 (unit total probability mass). These normalized data are used as input to the KL-based clustering algorithm. SOM clustering method The self-organizing neural network is an algorithm in which the input probability distribution is eventually reproduced as closely as possible from a sequence of inputs (Kohonen, 1989). In response to the input, the neurons in the network iteratively adjust, or self-organize the synaptic weights that connect them to their neighbors to estimate the input distribution. The SOM requires an initial set of weights, points in T dimensional space, where T is the number of time points. The weights used here are random vectors sampled from uniform distributions restricted at each time point to the range of the data. SOM training A SOM is trained using an iterative process during which the distance between each gene profile and the existing cluster centers is computed. We used the Kohonen training rule (Kohonen, 1995). The pseudocode below describes the algorithm used. Initialize the N cluster centers: c1, c2, . . . , cN Repeat steps 1 through 4 until convergence is reached. For iteration n: 1. Select a gene g from the normalized data. 2. Calculate the distance di from the gene to each cluster center ci . di = D(g||ci ), i = 1, 2, . . . , N 3 Identify the cluster k closest to g. k = arg min{di } i 4. Update the weights of the kth cluster and its immediate neighbors using the following learning rule: a) ck (n + 1) = ck (n) + 1(n)(g c (...truncated)