A mixture model-based approach to the clustering of microarray expression data (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/18/3/413.full.pdf

A mixture model-based approach to the clustering of microarray expression data

G. J. McLachlan 0 R. W. Bean 0 D. Peel 0 0 Department of Mathematics, University of Queensland , Brisbane, Queensland 4072 , Australia Motivation: This paper introduces the software EMMIXGENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. Availability: EMMIX-GENE is available at http://www. maths.uq.edu.au/gjm/emmix-gene/ Contact: - 1 INTRODUCTION The analysis of gene expression microarray data using clustering techniques has an important role to play in the discovery, validation, and understanding of various classes and subclasses of cancer; see, for example, Eisen et al. (1998), Ben-Dor et al. (1999, 2000), Alon et al. (1999), Golub et al. (1999), Hastie et al. (2000), Moler et al. (2000), Nguyen and Rocke (2001), and Xing and Karp (2001), among others. The clustering algorithm we present here, called EMMIX-GENE, can be applied to the problem of clustering tissue samples on the basis of genes and to the problem of clustering genes on the basis of tissues. For the clustering of genes, the EMMIXGENE software makes use of existing options from the EMMIX program of McLachlan et al. (1999). The tissue space and the gene space are generally of quite different dimensionality (10102 tissues versus 103104 genes). The clustering of the genes on the basis of the tissues is therefore a standard cluster analysis problem that can be effected by using existing software to fit normal mixture models. But unless the genes are assumed to be uncorrelated within a cluster, the clustering of the tissue samples on the basis of all the genes is nonstandard since the dimension of each tissue sample (the number of genes) is so much greater than the number of tissues. This dimensionality problem is handled with the EMMIXGENE approach by fitting mixtures of factor analyzers, which allow for nonzero component-correlations between the genes. Given the very large number of genes in a typical tissue sample, EMMIX-GENE initially considers a reduction in the number of genes to be used in the clustering process. The EMMIX-GENE approach is to be illustrated in the clustering of two well-known data sets in the microarray literature, the colon data analyzed initially in Alon et al. (2000), and the leukaemia data first analyzed in Golub et al. (1999). Before we proceed to present the EMMIX-GENE approach, we shall briefly summarize the normal mixture model and the extensions to mixtures of t distributions and to mixtures of factor analyzers. Finite mixtures of distributions have provided a sound mathematical-based approach to the statistical modelling of a wide variety of random phenomena; see, for example, McLachlan and Peel (2000a). For multivariate data of a continuous nature, where (x; i , i ) denotes the p-variate normal density probability function with mean i and covariance matrix i (i = 1, . . . , g). Here the vector of unknown parameters consists of the mixing proportions i , the elements of the component means i , and the distinct elements of the componentcovariance matrices i (i = 1, . . . , g). Under the assumption that x1, . . . , xn are independent observations, the log likelihood function for the parameter vector can be formed by summing over the log mixture density at each point x j to give The maximum likelihood estimate of is obtained as an appropriate root of the likelihood equation Solutions of (3) corresponding to local maxima can be found iteratively by application of the Expectation Maximization (EM) algorithm of Dempster et al. (1977); see also McLachlan and Krishnan (1997). The EM algorithm is applied in the framework where each observation x j is conceptualized to have arisen from one of the components and the indicator variable denoting its component of origin is taken to be missing. The so-called complete-data log likelihood is formed on the basis of these indicator variables in addition to the observed data x1, . . . , xn . On the E-step, the complete-data log likelihood is averaged over the conditional distribution of the indicator variables given the observed data, using the current estimate of the parameter vector. Since the complete-data log likelihood is linear in these indicator variables, the E-step of the EM algorithm simply involves replacing them by the current values of their conditional expectations, which are the so-called posterior probabilities of component membership. The posterior probability that the j th data point belongs to the i th component of the mixture is written here as i (x j ; ) and is given by i (x j ; ) = i (x j ; i , i )/ f (x j ; ) attention has focused on the use of multivariate normal components because of their computational convenience. We let x1, . . . , xn denote n p-dimensional observations. With a normal mixture model-based approach to clustering of these data, it is assumed that each observation x j is from a mixture of an initially specified number g of multivariate normal densities in some unknown proportions 1, . . . , g. That is, x j is taken to be a realization of a random vector X having the mixture probability density function (p.d.f.) f (x; ) defined by, for i = 1, . . . , g and j = 1, . . . , n. On the M-step, the estimates of the component mixing proportions, means, and covariance matrices are updated by using the current values for the posterior probabilities in place of the indicator variables in the usual closed-form expressions for the sample proportions, means, and covariance matrices. The E- and M-steps are alternated repeatedly until conver (...truncated)