On the classification of microarray gene-expression data
B RIEFINGS IN BIOINF ORMATICS . VOL 14. NO 4. 402^ 410
Advance Access published on 17 September 2012
doi:10.1093/bib/bbs056
On the classification of microarray
gene-expression data
Kaye E. Basford, Geoffrey J. McLachlan and Suren I. Rathnayake
Submitted: 16th May 2012; Received (in revised form) : 31st July 2012
Abstract
Keywords: supervised classification; selection bias; unsupervised classification; mixture models; factor models;
time-course data
INTRODUCTION
DNA microarray technology, first described in the
mid-1990s, is a method to perform experiments on
thousands of gene fragments in parallel. Its widespread use has led to a huge growth in the amount
of expression data available. A variety of multivariate
analysis methods such as cluster and discriminant
analyses have been used to explore gene-expression
data for relationships among the genes and the tissue
samples. The utility of these methods has been
demonstrated, for example, in the elucidation of
unknown gene function, the validation of gene discoveries, the interpretation of biological processes,
and the diagnosis and prediction of disease or treatment outcomes [1–3]. A common goal of microarray
analyses of many diseases, in particular, cancer, is to
identify as yet unclassified cancer subtypes for
subsequent validation and prediction, and ultimately
to develop individualized prognosis and therapy.
Limiting factors include the difficulties of tissue
acquisition and the expense of microarray experiments. Thus, often microarray studies attempt to
perform an analysis of a small number of tumor
samples on the basis of a large number of genes
and can result in gene-to-sample ratios of 100
fold. This is known as the ‘big p, small n’ problem
in statistics, where p denotes the number of variables
and n the number of observations. It is not the
standard situation in statistics, where many procedures have been developed for the case where the
variable-to-observation number ratio (corresponding
to the gene-to-sample ratio) is relatively small.
Although biological experiments vary considerably in their design, the data generated by microarray
experiments can be viewed as a matrix of expression
levels. For M microarray experiments (corresponding
to M tissue samples), where we measure the expression levels of N genes in each experiment, the results
can be represented by a N M matrix. For each
tissue, we can consider the expression levels of the
Corresponding author. Geoffrey J. McLachlan, Department of Mathematics, University of Queensland, St Lucia, QLD 4072, Australia.
Tel.: þ61-7-3365-2150; Fax: þ61-7-3365-1477; E-mail:
Kaye E. Basford is the President of the Academic Board of the University of Queensland, Australia, and oversees the Queensland
node of the Australian Centre for Plant Functional Genomics. She holds a PhD degree in statistics from the University of Queensland
and is a fellow of the Australian Academy of Technological Sciences and Engineering.
Geoffrey J. McLachlan is Professor of Statistics at the University of Queensland, Australia. He holds a PhD degree in statistics from
the University of Queensland and was also awarded a Doctor of Science.
Suren I. Rathnayake is a post-doctoral fellow at the School of Mathematics and Physics, University of Queensland, Australia. He
holds a PhD degree in biomedical engineering from the University of Queensland.
ß The Author 2012. Published by Oxford University Press. For Permissions, please email:
We consider the classification of microarray gene-expression data. First, attention is given to the supervised case,
where the tissue samples are classified with respect to a number of predefined classes and the intent is to assign a
new unclassified tissue to one of these classes. The problems of forming a classifier and estimating its error rate
are addressed in the context of there being a relatively small number of observations (tissue samples) compared to
the number of variables (that is, the genes, which can number in the tens of thousands). We then proceed to the
unsupervised case and consider the clustering of the tissue samples and also the clustering of the gene profiles.
Both problems can be viewed as being non-standard ones in statistics and we address some of the key issues
involved. The focus is on the use of mixture models to effect the clustering for both problems.
On the classification of microarray gene-expression data
Sample 1 Sample 2
...
Gene 2
Expression Profile
Sample M
Expression Signature
Gene 1
...
Gene N
Figure 1: Gene expression data from M microarray
experiments represented as a matrix of expression
levels with the N rows corresponding to the N genes
and the M columns to the M tissue samples.
arrange genes in some natural order, that is, to
organize genes into clusters with similar behavior
across relevant tissue samples (or cell lines).
Although a cluster does not automatically correspond
to a pathway, it is a reasonable approximation that
genes in the same cluster have something to do with
each other, such as being involved in the same pathway and/or having the same gene functions.
It can be seen that there are two distinct but
related clustering problems with microarray data.
One problem concerns the clustering of the tissues
on the basis of the genes (the gene signatures) and the
other concerns the clustering of the genes on the
basis of the tissues (the gene profiles). This duality
in cluster analysis is quite common. In the present
context of microarray data, one may be interested in
grouping tissues (patients) with similar expression
values or in grouping genes on patients with similar
types of tumors or similar survival rates.
SUPERVISED CLASSIFICATION
In this section, we consider the supervised classification of tissue samples. The intent is to form a classifier for assigning an unclassified sample with gene
expression signature y0 to one of g classes. The classes
may represent, for example, different types of a
tumor, different types of treatment to be administered to a patient or the different outcomes (absence
or presence of distant metastases within 5 years) for a
patient undergoing a particular treatment such as
chemotherapy. In this case, the M available tissue
samples represented in Figure 1 are of known classification with respect to the g classes; that is, we know
the class label zj corresponding to the jth gene signature vector yj (j ¼ 1, . . . , n), where zj ¼ i if the jth
tissue is from the ith class (i ¼ 1, . . . , g; j ¼ 1, . . . , n).
Because the dimension p of each gene expression
signature yj is very large, being equal to the number
of genes N, some form of dimension or variable
(gene) selection is usually undertaken first before an
attempt is made to form a classifier from the training
data. One common approach is to replace the N
genes by a smaller number k of linear combinations
of the genes; that is, the N genes in each gene signature vector yj are replaced by the k linear combinT
ations (...truncated)