Knowledge-based gene expression classification via matrix factorization
R. Schachtner
1
D. Lutter
0
1
P. Knollmller
1
A. M. Tom
4
F. J. Theis
1
G. Schmitz
0
M. Stetter
3
P. Gmez Vilda
2
E. W. Lang
1
Associate Editor: Olga Troyanskaya
0
Clinical Chemistry, University Hospital Regensburg
, D-93042 Regensburg,
Germany
1
CIML/Biophysics, University of Regensburg
, D-93040 Regensburg
2
DATSI/FI,
Universidad Politcnica de Madrid
, E-18500 Madrid,
Spain
3
Siemens Corporate Technology
, Siemens AG,
Munich, Germany
4
IEETA/DETI,
Universidade de Aveiro
, 3810-193 Aveiro,
Portugal
-
Motivation: Modern machine learning methods based on matrix
decomposition techniques, like independent component analysis
(ICA) or non-negative matrix factorization (NMF), provide new and
efficient analysis tools which are currently explored to analyze
gene expression profiles. These exploratory feature extraction
techniques yield expression modes (ICA) or metagenes (NMF). These
extracted features are considered indicative of underlying regulatory
processes. They can as well be applied to the classification of gene
expression datasets by grouping samples into different categories
for diagnostic purposes or group genes into functional categories
for further investigation of related metabolic pathways and regulatory
networks.
Results: In this study we focus on unsupervised matrix factorization
techniques and apply ICA and sparse NMF to microarray datasets.
The latter monitor the gene expression levels of human peripheral
blood cells during differentiation from monocytes to macrophages.
We show that these tools are able to identify relevant signatures in the
deduced component matrices and extract informative sets of marker
genes from these gene expression profiles. The methods rely on the
joint discriminative power of a set of marker genes rather than on
single marker genes. With these sets of marker genes, corroborated
by leave-one-out or random forest cross-validation, the datasets
could easily be classified into related diagnostic categories. The latter
correspond to either monocytes versus macrophages or healthy vs
Niemann Pick C disease patients.
Supplementary information: Supplementary data are available at
Bioinformatics online.
Contact:
1 INTRODUCTION
Modern signal processing and machine learning techniques
provide appropriate tools to analyze high-throughput datasets like
microarrays. Despite the fact that many problems still remain to
be solved (Dougherty and Datta, 2005; Dougherty et al., 2005;
Quackenbush, 2001), some consensus is slowly reached as to how
data should be analyzed properly (Allison et al., 2006).
Raw gene expression level measurements need sophisticated
preprocessing (Wu and Irizarry, 2007) encompassing background
correction, summarization, normalization (Baldi and Hatfield, 2002;
Hochreiter et al., 2006) and missing value imputation (Troyanskaya
et al., 2001), which is often done using software available from the
chip producer (Affymetrix, 2002).
After preprocessing, normalized gene expression levels can be
analyzed using feature extraction (Guyon and Elisseeff, 2003) and
classification (Dudoit et al., 2002) methods. Any statistical analysis
of gene expression probe level data, however, has to face the large
N , small M problem setting, where N denotes the number of genes
(= features, variables, parameters) and M denotes the number of
samples (= experiments, environments, tissues). Also overfitting
has to be avoided to construct a classifier with a good generalization
ability (Spang et al., 2002). Any robust classifier needs a
sampleper-feature (SpF) ratio of 5-to 10-fold, while with usual microarray
probe level measurements the SpF amounts to 1/501/200 roughly.
Hence a substantial reduction of the feature space dimensionality
via gene or feature selection is often the only way out of this SpF
dilemma.
Traditionally two strategies exist to analyze such sets of gene
expression signatures: Supervised approaches and Unsupervised
approaches. Supervised approaches afford prior knowledge such
as class labels, clinical outcomes, prior densities, etc. and a
truly representative set of training data. They are generally
used for classification of malignancies within a discriminant
analysis. Unsupervised approaches explore correlations in the
highdimensional data space and find appropriate transformations to
identify relevant subspaces and group observations accordingly.
However, such approaches often need additional constraints to yield
unique answers but they allow for the detection of new, yet unknown
classes (Saidi et al., 2004). For a detailed account of the relevant
literature see the extended Introduction in the accompanying
Supplementary Material.
There is a recent interest in applying exploratory matrix
factorization (MF) techniques, like principal component analysis
(PCA), independent component analysis (ICA) or non-negative
matrix factorization (NMF), to gene expression level measurements
with microarrays (Liebermeister, 2002). In this study we propose
to include diagnostic knowledge and explore the potential of
matrix decomposition techniques to identify and extract marker
genes from microarray data sets and classify these datasets
according to the diagnostic classes they represent. Note that the
feature extraction process via exploratory matrix decomposition
techniques is unsupervised, but the identification of the most relevant
features follows the supervision of diagnostic information available.
Preliminary work along these lines has been presented recently at
a conference (Schachtner et al., 2007a). Corresponding supervised
feature extraction and classification techniques like support vector
machines (SVM) have been applied to the same dataset and are
discussed in short as well. For a more detailed discussion of these
supervised techniques, though applied to different datasets, see
(Schachtner et al., 2007b).
THE MONOCYTEMACROPHAGE DATASET
For our analysis we combined the gene-chip results from
three different experimental settings to the monocytemacrophage
(MoMa) dataset (Lutter et al., 2008). In each experiment human
peripheral blood monocytes were isolated from healthy donors
(Experiment 1 and 2) and from donors with Niemann Pick
type C disease (Experiment 3). Monocytes were differentiated to
macrophages for 4 days in the presence of M-CSF (50 ng/ml,
R&D Systems). Differentiation was confirmed by phase contrast
microscopy. Gene-expression profiles were determined using
Affymetrix HG-U133A (Experiment 1 and 2) and HG-U133plus2.0
(Experiment 3) Gene Chips covering 22 215 probe sets and
about 18 400 transcripts (HG-U133A). Probe sets only covered
by HG-U133plus2.0 array were excluded from further analysis.
In Experiment 1 pooled RNA was used for hybridization, while
in Experiment 2 and 3 RNA from single donors were used. The
final dataset consisted of seven monocyte and seven macrophage
expression profiles and contained 22 215 probe sets. After filtering
out probe sets which had at least one absent call, 5969 probe sets
remaine (...truncated)