Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach

Bioinformatics, Nov 2004

Motivation: The DNA microarray technology has been increasingly used in cancer research. In the literature, discovery of putative classes and classification to known classes based on gene expression data have been largely treated as separate problems. This paper offers a unified approach to class discovery and classification, which we believe is more appropriate, and has greater applicability, in practical situations. Results: We model the gene expression profile of a tumor sample as from a finite mixture distribution, with each component characterizing the gene expression levels in a class. The proposed method was applied to a leukemia dataset, and good results are obtained. With appropriate choices of genes and preprocessing method, the number of leukemia types and subtypes is correctly inferred, and all the tumor samples are correctly classified into their respective type/subtype. Further evaluation of the method was carried out on other variants of the leukemia data and a colon dataset. Supplementary information: The program implementing the method and additional details and figures are at http://www.stat.ohio-state.edu/~statgen/PAPERS/DNC-MIX.html.

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/20/16/2545.full.pdf

Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach

Roxana Alexandridis 0 Shili Lin 0 Mark Irwin 0 0 Department of Statistics, Ohio State University , 1958 Neil Avenue, Columbus, OH 43210 , USA Motivation: The DNA microarray technology has been increasingly used in cancer research. In the literature, discovery of putative classes and classification to known classes based on gene expression data have been largely treated as separate problems. This paper offers a unified approach to class discovery and classification, which we believe is more appropriate, and has greater applicability, in practical situations. Results: We model the gene expression profile of a tumor sample as from a finite mixture distribution, with each component characterizing the gene expression levels in a class. The proposed method was applied to a leukemia dataset, and good results are obtained. With appropriate choices of genes and preprocessing method, the number of leukemia types and subtypes is correctly inferred, and all the tumor samples are correctly classified into their respective type/subtype. Further evaluation of the method was carried out on other variants of the leukemia data and a colon dataset. Contact: Supplementary information: The program implementing the method and additional details and figures are at http:// www.stat.ohio-state.edu/statgen/PAPERS/DNC-MIX.html. - INTRODUCTION Accurate classification of tumor samples is an essential tool for efficient cancer treatment. For many cancers, such as acute adult leukemias or non-Hodgkins lymphomas, different subtypes show very different responses to therapy, although they have very similar morphological and histopathological appearance, reflecting the fact that they are molecularly distinct entities (Golub et al., 1999). The DNA microarray technology has been increasingly used in cancer research, which enables classification of tissue samples based only on gene expression data, without prior and often subjective biological knowledge (Golub et al., 1999; Dudoit et al., 2002). A considerable amount of research involving microarray data analysis is focused on the discovery of putative types and subtypes of cancers using gene expression profiles of disease samples. Unsupervised learning approaches, techniques commonly used for this problem, have the advantage of being impartial to currently accepted classes, but they may reveal a structure that is not biologically significant. Most of the recent publications on this issue utilize cluster analysis techniques to group tumor samples and/or genes, using techniques such as self-organizing maps (SOMs) (e.g. Golub et al., 1999) and hierarchical clustering (e.g. Alon et al., 1999). In addition to class discovery, an equally important problem is to classify test samples into known classes, with the help of a training set containing samples whose classes are known. Numerous approaches based on gene expression data have been proposed for classifying test samples into known classes, without allowing them to belong to new classes. Some of these are applicable only to binary classification, such as the weighted voting scheme of Golub et al. (1999), whereas others can handle multitype classification problems. These approaches range from traditional methods, such as Fishers linear and quadratic discriminant analysis, to more modern machine learning techniques, such as classification trees or aggregation of classifiers by bagging or boosting (for a review see Dudoit et al., 2002). There are also approaches, which are able to identify test samples that do not belong to any of the known classes by imposing thresholds on the prediction strength (e.g. Golub et al., 1999; Lee and Lee, 2002). However, they were not able to place these samples into new putative classes. This paper proposes a unified approach to class discovery, classification into known classes, and the joint analysis of classification and class discovery. The method proposed is an extension of Lin and Alexandridis (2003), and is based on modeling the distribution of the gene expression profile of a test sample as a finite mixture of an unknown number of distributions, with each mixture component characterizing the gene expression levels within a class. The distributional assumptions made here are the same as those in diagonal quadratic discriminant analysis (Dudoit et al., 2002), but both the training samples (if they exist) and the test samples are used to estimate the parameters of the model in our formulation. We applied the method proposed to the leukemia data of Golub et al. (1999) and a number of resampled datasets based on it. Further evaluation of the method was carried out on the colon cancer data of Alon et al. (1999). We use several measures for gene selection, and we explore the sensitivity of the class discovery and class prediction results on the number of genes in a classifier. METHODS Mixture modeling of test samples Let K be the number of known classes, which is zero in the absence of training samples. Let yki = (y1ki , . . . , yGki ) denote the i-th training sample from class k, k = 1, . . . , K, i = 1, . . . , Nk. The length of the vector, G, is the number of genes used for class discovery and classification, and is referred to as the classifier size. Hence, the class labels of all training samples are known. Furthermore, we use xi = (x1i , . . . , xGi ) , i = 1, . . . , T , to denote the i-th test sample, and we assume that the test samples can come from the K known classes as well as from U putative classes. However, there may not be any test samples belonging to some (or all) of the known classes. Consequently, the distribution of the test samples is modeled as a mixture of distributions of the M = K + U components as follows: mfm(xi | m, 2m), i = 1, . . . , T , m=1 where fm is the probability density function of the m-th component of the mixture and is assumed to be normal. The parameter set M = (1, . . . , M , 1, . . . , M , 21, . . . , 2M ) then contains the mixture coefficients m ( mM=1 m = 1), the mean vectors m, and the vectors 2m of the parameters of the variancecovariance matrices, m = 1, . . . , M. Note that M max{1, K} such that, if training samples do not exist, there is still at least one putative class for the test samples. We further assume a diagonal variancecovariance matrix for each component density, therefore, 2m is a vector of the variances on the diagonal. EM estimation of parameters The maximum-likelihood estimates (MLEs) of the parameters, M , are obtained using the Expectation Maximization (EM) algorithm (McLachlan and Peel, 2000). Let Zi denote the unknown class label, the missing component, of the i-th test sample, which takes values in the set {1, . . . , M}. Thus, the complete data are {(xi , Zi ), i = 1, . . . , T } {(yki , k), k = 1, . . . , K, i = 1, . . . , Nk}, and the corresponding complete data likelihood is i=1 m=1 [mf (xi | m, 2m)]I (Zi=m) , where I (Zi = m) is an indicator function taking the value of 1 if Zi = (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/20/16/2545.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/20/16/2545.abstract

Roxana Alexandridis, Shili Lin, Mark Irwin. Class discovery and classification of tumor samples using mixture modeling of gene expression data—a unified approach, Bioinformatics, 2004, pp. 2545-2552, 20/16, DOI: 10.1093/bioinformatics/bth281