A mixture model-based approach to the clustering of microarray expression data
G. J. McLachlan
0
R. W. Bean
0
D. Peel
0
0
Department of Mathematics, University of Queensland
,
Brisbane, Queensland 4072
,
Australia
Motivation: This paper introduces the software EMMIXGENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. Availability: EMMIX-GENE is available at http://www. maths.uq.edu.au/gjm/emmix-gene/ Contact:
-
1 INTRODUCTION
The analysis of gene expression microarray data using
clustering techniques has an important role to play in
the discovery, validation, and understanding of various
classes and subclasses of cancer; see, for example, Eisen
et al. (1998), Ben-Dor et al. (1999, 2000), Alon et al.
(1999), Golub et al. (1999), Hastie et al. (2000), Moler
et al. (2000), Nguyen and Rocke (2001), and Xing and
Karp (2001), among others. The clustering algorithm we
present here, called EMMIX-GENE, can be applied to
the problem of clustering tissue samples on the basis
of genes and to the problem of clustering genes on the
basis of tissues. For the clustering of genes, the
EMMIXGENE software makes use of existing options from the
EMMIX program of McLachlan et al. (1999). The tissue
space and the gene space are generally of quite different
dimensionality (10102 tissues versus 103104 genes).
The clustering of the genes on the basis of the tissues
is therefore a standard cluster analysis problem that can
be effected by using existing software to fit normal
mixture models. But unless the genes are assumed to be
uncorrelated within a cluster, the clustering of the tissue
samples on the basis of all the genes is nonstandard
since the dimension of each tissue sample (the number
of genes) is so much greater than the number of tissues.
This dimensionality problem is handled with the
EMMIXGENE approach by fitting mixtures of factor analyzers,
which allow for nonzero component-correlations between
the genes. Given the very large number of genes in a
typical tissue sample, EMMIX-GENE initially considers
a reduction in the number of genes to be used in the
clustering process.
The EMMIX-GENE approach is to be illustrated in the
clustering of two well-known data sets in the microarray
literature, the colon data analyzed initially in Alon et
al. (2000), and the leukaemia data first analyzed in
Golub et al. (1999).
Before we proceed to present the EMMIX-GENE
approach, we shall briefly summarize the normal mixture
model and the extensions to mixtures of t distributions
and to mixtures of factor analyzers. Finite mixtures of
distributions have provided a sound mathematical-based
approach to the statistical modelling of a wide variety of
random phenomena; see, for example, McLachlan and
Peel (2000a). For multivariate data of a continuous nature,
where (x; i , i ) denotes the p-variate normal density
probability function with mean i and covariance matrix
i (i = 1, . . . , g). Here the vector of unknown
parameters consists of the mixing proportions i , the elements of
the component means i , and the distinct elements of the
componentcovariance matrices i (i = 1, . . . , g).
Under the assumption that x1, . . . , xn are independent
observations, the log likelihood function for the parameter
vector can be formed by summing over the log mixture
density at each point x j to give
The maximum likelihood estimate of is obtained as an
appropriate root of the likelihood equation
Solutions of (3) corresponding to local maxima can
be found iteratively by application of the Expectation
Maximization (EM) algorithm of Dempster et al. (1977);
see also McLachlan and Krishnan (1997). The EM
algorithm is applied in the framework where each
observation x j is conceptualized to have arisen from one of
the components and the indicator variable denoting its
component of origin is taken to be missing. The so-called
complete-data log likelihood is formed on the basis of
these indicator variables in addition to the observed
data x1, . . . , xn . On the E-step, the complete-data log
likelihood is averaged over the conditional distribution
of the indicator variables given the observed data, using
the current estimate of the parameter vector. Since the
complete-data log likelihood is linear in these indicator
variables, the E-step of the EM algorithm simply involves
replacing them by the current values of their conditional
expectations, which are the so-called posterior
probabilities of component membership. The posterior probability
that the j th data point belongs to the i th component of the
mixture is written here as i (x j ; ) and is given by
i (x j ; ) = i (x j ; i , i )/ f (x j ; )
attention has focused on the use of multivariate normal
components because of their computational convenience.
We let x1, . . . , xn denote n p-dimensional observations.
With a normal mixture model-based approach to
clustering of these data, it is assumed that each observation x j is
from a mixture of an initially specified number g of
multivariate normal densities in some unknown proportions
1, . . . , g. That is, x j is taken to be a realization of a
random vector X having the mixture probability density
function (p.d.f.) f (x; ) defined by,
for i = 1, . . . , g and j = 1, . . . , n. On the M-step, the
estimates of the component mixing proportions, means, and
covariance matrices are updated by using the current
values for the posterior probabilities in place of the
indicator variables in the usual closed-form expressions for the
sample proportions, means, and covariance matrices. The
E- and M-steps are alternated repeatedly until conver (...truncated)