On the classification of microarray gene-expression data (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bib/article-pdf/14/4/402/478239/bbs056.pdf

On the classification of microarray gene-expression data

B RIEFINGS IN BIOINF ORMATICS . VOL 14. NO 4. 402^ 410 Advance Access published on 17 September 2012 doi:10.1093/bib/bbs056 On the classification of microarray gene-expression data Kaye E. Basford, Geoffrey J. McLachlan and Suren I. Rathnayake Submitted: 16th May 2012; Received (in revised form) : 31st July 2012 Abstract Keywords: supervised classification; selection bias; unsupervised classification; mixture models; factor models; time-course data INTRODUCTION DNA microarray technology, first described in the mid-1990s, is a method to perform experiments on thousands of gene fragments in parallel. Its widespread use has led to a huge growth in the amount of expression data available. A variety of multivariate analysis methods such as cluster and discriminant analyses have been used to explore gene-expression data for relationships among the genes and the tissue samples. The utility of these methods has been demonstrated, for example, in the elucidation of unknown gene function, the validation of gene discoveries, the interpretation of biological processes, and the diagnosis and prediction of disease or treatment outcomes [1–3]. A common goal of microarray analyses of many diseases, in particular, cancer, is to identify as yet unclassified cancer subtypes for subsequent validation and prediction, and ultimately to develop individualized prognosis and therapy. Limiting factors include the difficulties of tissue acquisition and the expense of microarray experiments. Thus, often microarray studies attempt to perform an analysis of a small number of tumor samples on the basis of a large number of genes and can result in gene-to-sample ratios of 100 fold. This is known as the ‘big p, small n’ problem in statistics, where p denotes the number of variables and n the number of observations. It is not the standard situation in statistics, where many procedures have been developed for the case where the variable-to-observation number ratio (corresponding to the gene-to-sample ratio) is relatively small. Although biological experiments vary considerably in their design, the data generated by microarray experiments can be viewed as a matrix of expression levels. For M microarray experiments (corresponding to M tissue samples), where we measure the expression levels of N genes in each experiment, the results can be represented by a N M matrix. For each tissue, we can consider the expression levels of the Corresponding author. Geoffrey J. McLachlan, Department of Mathematics, University of Queensland, St Lucia, QLD 4072, Australia. Tel.: þ61-7-3365-2150; Fax: þ61-7-3365-1477; E-mail: Kaye E. Basford is the President of the Academic Board of the University of Queensland, Australia, and oversees the Queensland node of the Australian Centre for Plant Functional Genomics. She holds a PhD degree in statistics from the University of Queensland and is a fellow of the Australian Academy of Technological Sciences and Engineering. Geoffrey J. McLachlan is Professor of Statistics at the University of Queensland, Australia. He holds a PhD degree in statistics from the University of Queensland and was also awarded a Doctor of Science. Suren I. Rathnayake is a post-doctoral fellow at the School of Mathematics and Physics, University of Queensland, Australia. He holds a PhD degree in biomedical engineering from the University of Queensland. ß The Author 2012. Published by Oxford University Press. For Permissions, please email: We consider the classification of microarray gene-expression data. First, attention is given to the supervised case, where the tissue samples are classified with respect to a number of predefined classes and the intent is to assign a new unclassified tissue to one of these classes. The problems of forming a classifier and estimating its error rate are addressed in the context of there being a relatively small number of observations (tissue samples) compared to the number of variables (that is, the genes, which can number in the tens of thousands). We then proceed to the unsupervised case and consider the clustering of the tissue samples and also the clustering of the gene profiles. Both problems can be viewed as being non-standard ones in statistics and we address some of the key issues involved. The focus is on the use of mixture models to effect the clustering for both problems. On the classification of microarray gene-expression data Sample 1 Sample 2 ... Gene 2 Expression Profile Sample M Expression Signature Gene 1 ... Gene N Figure 1: Gene expression data from M microarray experiments represented as a matrix of expression levels with the N rows corresponding to the N genes and the M columns to the M tissue samples. arrange genes in some natural order, that is, to organize genes into clusters with similar behavior across relevant tissue samples (or cell lines). Although a cluster does not automatically correspond to a pathway, it is a reasonable approximation that genes in the same cluster have something to do with each other, such as being involved in the same pathway and/or having the same gene functions. It can be seen that there are two distinct but related clustering problems with microarray data. One problem concerns the clustering of the tissues on the basis of the genes (the gene signatures) and the other concerns the clustering of the genes on the basis of the tissues (the gene profiles). This duality in cluster analysis is quite common. In the present context of microarray data, one may be interested in grouping tissues (patients) with similar expression values or in grouping genes on patients with similar types of tumors or similar survival rates. SUPERVISED CLASSIFICATION In this section, we consider the supervised classification of tissue samples. The intent is to form a classifier for assigning an unclassified sample with gene expression signature y0 to one of g classes. The classes may represent, for example, different types of a tumor, different types of treatment to be administered to a patient or the different outcomes (absence or presence of distant metastases within 5 years) for a patient undergoing a particular treatment such as chemotherapy. In this case, the M available tissue samples represented in Figure 1 are of known classification with respect to the g classes; that is, we know the class label zj corresponding to the jth gene signature vector yj (j ¼ 1, . . . , n), where zj ¼ i if the jth tissue is from the ith class (i ¼ 1, . . . , g; j ¼ 1, . . . , n). Because the dimension p of each gene expression signature yj is very large, being equal to the number of genes N, some form of dimension or variable (gene) selection is usually undertaken first before an attempt is made to form a classifier from the training data. One common approach is to replace the N genes by a smaller number k of linear combinations of the genes; that is, the N genes in each gene signature vector yj are replaced by the k linear combinT ations (...truncated)