Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach
Roxana Alexandridis
0
Shili Lin
0
Mark Irwin
0
0
Department of Statistics, Ohio State University
,
1958 Neil Avenue, Columbus, OH 43210
,
USA
Motivation: The DNA microarray technology has been increasingly used in cancer research. In the literature, discovery of putative classes and classification to known classes based on gene expression data have been largely treated as separate problems. This paper offers a unified approach to class discovery and classification, which we believe is more appropriate, and has greater applicability, in practical situations. Results: We model the gene expression profile of a tumor sample as from a finite mixture distribution, with each component characterizing the gene expression levels in a class. The proposed method was applied to a leukemia dataset, and good results are obtained. With appropriate choices of genes and preprocessing method, the number of leukemia types and subtypes is correctly inferred, and all the tumor samples are correctly classified into their respective type/subtype. Further evaluation of the method was carried out on other variants of the leukemia data and a colon dataset. Contact: Supplementary information: The program implementing the method and additional details and figures are at http:// www.stat.ohio-state.edu/statgen/PAPERS/DNC-MIX.html.
-
INTRODUCTION
Accurate classification of tumor samples is an essential tool
for efficient cancer treatment. For many cancers, such as
acute adult leukemias or non-Hodgkins lymphomas, different
subtypes show very different responses to therapy, although
they have very similar morphological and histopathological
appearance, reflecting the fact that they are molecularly
distinct entities (Golub et al., 1999). The DNA microarray
technology has been increasingly used in cancer research,
which enables classification of tissue samples based only
on gene expression data, without prior and often subjective
biological knowledge (Golub et al., 1999; Dudoit et al., 2002).
A considerable amount of research involving microarray
data analysis is focused on the discovery of putative types
and subtypes of cancers using gene expression profiles of
disease samples. Unsupervised learning approaches, techniques
commonly used for this problem, have the advantage of being
impartial to currently accepted classes, but they may reveal a
structure that is not biologically significant. Most of the recent
publications on this issue utilize cluster analysis techniques
to group tumor samples and/or genes, using techniques such
as self-organizing maps (SOMs) (e.g. Golub et al., 1999) and
hierarchical clustering (e.g. Alon et al., 1999).
In addition to class discovery, an equally important
problem is to classify test samples into known classes, with the
help of a training set containing samples whose classes are
known. Numerous approaches based on gene expression data
have been proposed for classifying test samples into known
classes, without allowing them to belong to new classes. Some
of these are applicable only to binary classification, such as
the weighted voting scheme of Golub et al. (1999), whereas
others can handle multitype classification problems. These
approaches range from traditional methods, such as Fishers
linear and quadratic discriminant analysis, to more modern
machine learning techniques, such as classification trees or
aggregation of classifiers by bagging or boosting (for a review
see Dudoit et al., 2002). There are also approaches, which
are able to identify test samples that do not belong to any
of the known classes by imposing thresholds on the
prediction strength (e.g. Golub et al., 1999; Lee and Lee, 2002).
However, they were not able to place these samples into new
putative classes.
This paper proposes a unified approach to class discovery,
classification into known classes, and the joint analysis of
classification and class discovery. The method proposed is
an extension of Lin and Alexandridis (2003), and is based
on modeling the distribution of the gene expression profile
of a test sample as a finite mixture of an unknown number
of distributions, with each mixture component characterizing
the gene expression levels within a class. The distributional
assumptions made here are the same as those in diagonal
quadratic discriminant analysis (Dudoit et al., 2002), but both the
training samples (if they exist) and the test samples are used to
estimate the parameters of the model in our formulation. We
applied the method proposed to the leukemia data of Golub
et al. (1999) and a number of resampled datasets based on it.
Further evaluation of the method was carried out on the colon
cancer data of Alon et al. (1999). We use several measures
for gene selection, and we explore the sensitivity of the class
discovery and class prediction results on the number of genes
in a classifier.
METHODS
Mixture modeling of test samples
Let K be the number of known classes, which is zero in
the absence of training samples. Let yki = (y1ki , . . . , yGki )
denote the i-th training sample from class k, k = 1, . . . , K,
i = 1, . . . , Nk. The length of the vector, G, is the
number of genes used for class discovery and classification, and
is referred to as the classifier size. Hence, the class labels
of all training samples are known. Furthermore, we use
xi = (x1i , . . . , xGi ) , i = 1, . . . , T , to denote the i-th test
sample, and we assume that the test samples can come from
the K known classes as well as from U putative classes.
However, there may not be any test samples belonging to some (or
all) of the known classes. Consequently, the distribution of
the test samples is modeled as a mixture of distributions of
the M = K + U components as follows:
mfm(xi | m, 2m), i = 1, . . . , T ,
m=1
where fm is the probability density function of the m-th
component of the mixture and is assumed to be normal. The
parameter set M = (1, . . . , M , 1, . . . , M , 21, . . . , 2M )
then contains the mixture coefficients m ( mM=1 m = 1),
the mean vectors m, and the vectors 2m of the parameters of
the variancecovariance matrices, m = 1, . . . , M. Note that
M max{1, K} such that, if training samples do not exist,
there is still at least one putative class for the test samples. We
further assume a diagonal variancecovariance matrix for each
component density, therefore, 2m is a vector of the variances
on the diagonal.
EM estimation of parameters
The maximum-likelihood estimates (MLEs) of the
parameters, M , are obtained using the Expectation Maximization
(EM) algorithm (McLachlan and Peel, 2000). Let Zi denote
the unknown class label, the missing component, of the i-th
test sample, which takes values in the set {1, . . . , M}. Thus,
the complete data are {(xi , Zi ), i = 1, . . . , T } {(yki , k), k =
1, . . . , K, i = 1, . . . , Nk}, and the corresponding complete
data likelihood is
i=1 m=1
[mf (xi | m, 2m)]I (Zi=m) ,
where I (Zi = m) is an indicator function taking the value
of 1 if Zi = (...truncated)