Knowledge-based gene expression classification via matrix factorization (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/bioinformatics/article-pdf/24/15/1688/49049345/bioinformatics_24_15_1688.pdf

Knowledge-based gene expression classification via matrix factorization

BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 15 2008, pages 1688–1697 doi:10.1093/bioinformatics/btn245 Gene expression Knowledge-based gene expression classiﬁcation via matrix factorization R. Schachtner1 , D. Lutter1,2,3 , P. Knollmüller1 , A. M. Tomé4 , F. J. Theis1,2 , G. Schmitz3 , M. Stetter5 , P. Gómez Vilda6 and E. W. Lang1,∗ 1 CIML/Biophysics, University of Regensburg, D-93040 Regensburg, 2 CMB/IBI, GSF Munich, 3 Clinical Chemistry, University Hospital Regensburg, D-93042 Regensburg, Germany, 4 IEETA/DETI, Universidade de Aveiro, 3810-193 Aveiro, Portugal, 5 Siemens Corporate Technology, Siemens AG, Munich, Germany and 6 DATSI/FI, Universidad Politécnica de Madrid, E-18500 Madrid, Spain Received on September 24, 2007; revised on May 14, 2008; accepted on May 23, 2008 Advance Access publication June 5, 2008 ABSTRACT Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efﬁcient analysis tools which are currently explored to analyze gene expression proﬁles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classiﬁcation of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression proﬁles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classiﬁed into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: 1 INTRODUCTION Modern signal processing and machine learning techniques provide appropriate tools to analyze high-throughput datasets like microarrays. Despite the fact that many problems still remain to be solved (Dougherty and Datta, 2005; Dougherty et al., 2005; ∗ To whom correspondence should be addressed. Quackenbush, 2001), some consensus is slowly reached as to how data should be analyzed properly (Allison et al., 2006). Raw gene expression level measurements need sophisticated preprocessing (Wu and Irizarry, 2007) encompassing background correction, summarization, normalization (Baldi and Hatfield, 2002; Hochreiter et al., 2006) and missing value imputation (Troyanskaya et al., 2001), which is often done using software available from the chip producer (Affymetrix, 2002). After preprocessing, normalized gene expression levels can be analyzed using feature extraction (Guyon and Elisseeff, 2003) and classification (Dudoit et al., 2002) methods. Any statistical analysis of gene expression probe level data, however, has to face the ’large N, small M’ problem setting, where N denotes the number of genes (= features, variables, parameters) and M denotes the number of samples (= experiments, environments, tissues). Also overfitting has to be avoided to construct a classifier with a good generalization ability (Spang et al., 2002). Any robust classifier needs a sampleper-feature (SpF) ratio of 5-to 10-fold, while with usual microarray probe level measurements the SpF amounts to 1/50−1/200 roughly. Hence a substantial reduction of the feature space dimensionality via gene or feature selection is often the only way out of this SpF dilemma. Traditionally two strategies exist to analyze such sets of gene expression signatures: Supervised approaches and Unsupervised approaches. Supervised approaches afford prior knowledge such as class labels, clinical outcomes, prior densities, etc. and a truly representative set of training data. They are generally used for classification of malignancies within a discriminant analysis. Unsupervised approaches explore correlations in the highdimensional data space and find appropriate transformations to identify relevant subspaces and group observations accordingly. However, such approaches often need additional constraints to yield unique answers but they allow for the detection of new, yet unknown classes (Saidi et al., 2004). For a detailed account of the relevant literature see the extended ‘Introduction’ in the accompanying Supplementary Material. There is a recent interest in applying exploratory matrix factorization (MF) techniques, like principal component analysis (PCA), independent component analysis (ICA) or non-negative Associate Editor: Olga Troyanskaya © 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [11:37 17/7/03 Bioinformatics-btn245.tex] Page: 1688 1688–1697 Matrix Factorization matrix factorization (NMF), to gene expression level measurements with microarrays (Liebermeister, 2002). In this study we propose to include diagnostic knowledge and explore the potential of matrix decomposition techniques to identify and extract marker genes from microarray data sets and classify these datasets according to the diagnostic classes they represent. Note that the feature extraction process via exploratory matrix decomposition techniques is unsupervised, but the identification of the most relevant features follows the supervision of diagnostic information available. Preliminary work along these lines has been presented recently at a conference (Schachtner et al., 2007a). Corresponding supervised feature extraction and classification techniques like support vector machines (SVM) have been applied to the same dataset and are discussed in short as well. For a more detailed discussion of these supervised techniques, though applied to different datasets, see (Schachtner et al., 2007b). THE MONOCYTE–MACROPHAGE DATASET For our analysis we combined the gene-chip results from three different experimental settings to the monocyte–macrophage (MoMa) dataset (Lutter et al., 2008). In each experiment human peripheral blood monocytes were isolated from healthy donors (Experiment 1 and 2) and from do (...truncated)