Knowledge-based gene expression classification via matrix factorization
BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 15 2008, pages 1688–1697
doi:10.1093/bioinformatics/btn245
Gene expression
Knowledge-based gene expression classification via matrix
factorization
R. Schachtner1 , D. Lutter1,2,3 , P. Knollmüller1 , A. M. Tomé4 , F. J. Theis1,2 , G. Schmitz3 ,
M. Stetter5 , P. Gómez Vilda6 and E. W. Lang1,∗
1 CIML/Biophysics,
University of Regensburg, D-93040 Regensburg, 2 CMB/IBI, GSF Munich, 3 Clinical Chemistry,
University Hospital Regensburg, D-93042 Regensburg, Germany, 4 IEETA/DETI, Universidade de Aveiro, 3810-193
Aveiro, Portugal, 5 Siemens Corporate Technology, Siemens AG, Munich, Germany and 6 DATSI/FI, Universidad
Politécnica de Madrid, E-18500 Madrid, Spain
Received on September 24, 2007; revised on May 14, 2008; accepted on May 23, 2008
Advance Access publication June 5, 2008
ABSTRACT
Motivation: Modern machine learning methods based on matrix
decomposition techniques, like independent component analysis
(ICA) or non-negative matrix factorization (NMF), provide new and
efficient analysis tools which are currently explored to analyze
gene expression profiles. These exploratory feature extraction
techniques yield expression modes (ICA) or metagenes (NMF). These
extracted features are considered indicative of underlying regulatory
processes. They can as well be applied to the classification of gene
expression datasets by grouping samples into different categories
for diagnostic purposes or group genes into functional categories
for further investigation of related metabolic pathways and regulatory
networks.
Results: In this study we focus on unsupervised matrix factorization
techniques and apply ICA and sparse NMF to microarray datasets.
The latter monitor the gene expression levels of human peripheral
blood cells during differentiation from monocytes to macrophages.
We show that these tools are able to identify relevant signatures in the
deduced component matrices and extract informative sets of marker
genes from these gene expression profiles. The methods rely on the
joint discriminative power of a set of marker genes rather than on
single marker genes. With these sets of marker genes, corroborated
by leave-one-out or random forest cross-validation, the datasets
could easily be classified into related diagnostic categories. The latter
correspond to either monocytes versus macrophages or healthy vs
Niemann Pick C disease patients.
Supplementary information: Supplementary data are available at
Bioinformatics online.
Contact:
1
INTRODUCTION
Modern signal processing and machine learning techniques
provide appropriate tools to analyze high-throughput datasets like
microarrays. Despite the fact that many problems still remain to
be solved (Dougherty and Datta, 2005; Dougherty et al., 2005;
∗
To whom correspondence should be addressed.
Quackenbush, 2001), some consensus is slowly reached as to how
data should be analyzed properly (Allison et al., 2006).
Raw gene expression level measurements need sophisticated
preprocessing (Wu and Irizarry, 2007) encompassing background
correction, summarization, normalization (Baldi and Hatfield, 2002;
Hochreiter et al., 2006) and missing value imputation (Troyanskaya
et al., 2001), which is often done using software available from the
chip producer (Affymetrix, 2002).
After preprocessing, normalized gene expression levels can be
analyzed using feature extraction (Guyon and Elisseeff, 2003) and
classification (Dudoit et al., 2002) methods. Any statistical analysis
of gene expression probe level data, however, has to face the ’large
N, small M’ problem setting, where N denotes the number of genes
(= features, variables, parameters) and M denotes the number of
samples (= experiments, environments, tissues). Also overfitting
has to be avoided to construct a classifier with a good generalization
ability (Spang et al., 2002). Any robust classifier needs a sampleper-feature (SpF) ratio of 5-to 10-fold, while with usual microarray
probe level measurements the SpF amounts to 1/50−1/200 roughly.
Hence a substantial reduction of the feature space dimensionality
via gene or feature selection is often the only way out of this SpF
dilemma.
Traditionally two strategies exist to analyze such sets of gene
expression signatures: Supervised approaches and Unsupervised
approaches. Supervised approaches afford prior knowledge such
as class labels, clinical outcomes, prior densities, etc. and a
truly representative set of training data. They are generally
used for classification of malignancies within a discriminant
analysis. Unsupervised approaches explore correlations in the highdimensional data space and find appropriate transformations to
identify relevant subspaces and group observations accordingly.
However, such approaches often need additional constraints to yield
unique answers but they allow for the detection of new, yet unknown
classes (Saidi et al., 2004). For a detailed account of the relevant
literature see the extended ‘Introduction’ in the accompanying
Supplementary Material.
There is a recent interest in applying exploratory matrix
factorization (MF) techniques, like principal component analysis
(PCA), independent component analysis (ICA) or non-negative
Associate Editor: Olga Troyanskaya
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/)
which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
[11:37 17/7/03 Bioinformatics-btn245.tex]
Page: 1688
1688–1697
Matrix Factorization
matrix factorization (NMF), to gene expression level measurements
with microarrays (Liebermeister, 2002). In this study we propose
to include diagnostic knowledge and explore the potential of
matrix decomposition techniques to identify and extract marker
genes from microarray data sets and classify these datasets
according to the diagnostic classes they represent. Note that the
feature extraction process via exploratory matrix decomposition
techniques is unsupervised, but the identification of the most relevant
features follows the supervision of diagnostic information available.
Preliminary work along these lines has been presented recently at
a conference (Schachtner et al., 2007a). Corresponding supervised
feature extraction and classification techniques like support vector
machines (SVM) have been applied to the same dataset and are
discussed in short as well. For a more detailed discussion of these
supervised techniques, though applied to different datasets, see
(Schachtner et al., 2007b).
THE MONOCYTE–MACROPHAGE DATASET
For our analysis we combined the gene-chip results from
three different experimental settings to the monocyte–macrophage
(MoMa) dataset (Lutter et al., 2008). In each experiment human
peripheral blood monocytes were isolated from healthy donors
(Experiment 1 and 2) and from do (...truncated)