Knowledge-based gene expression classification via matrix factorization (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/24/15/1688.full.pdf

Knowledge-based gene expression classification via matrix factorization

R. Schachtner 1 D. Lutter 0 1 P. Knollmller 1 A. M. Tom 4 F. J. Theis 1 G. Schmitz 0 M. Stetter 3 P. Gmez Vilda 2 E. W. Lang 1 Associate Editor: Olga Troyanskaya 0 Clinical Chemistry, University Hospital Regensburg , D-93042 Regensburg, Germany 1 CIML/Biophysics, University of Regensburg , D-93040 Regensburg 2 DATSI/FI, Universidad Politcnica de Madrid , E-18500 Madrid, Spain 3 Siemens Corporate Technology , Siemens AG, Munich, Germany 4 IEETA/DETI, Universidade de Aveiro , 3810-193 Aveiro, Portugal - Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: 1 INTRODUCTION Modern signal processing and machine learning techniques provide appropriate tools to analyze high-throughput datasets like microarrays. Despite the fact that many problems still remain to be solved (Dougherty and Datta, 2005; Dougherty et al., 2005; Quackenbush, 2001), some consensus is slowly reached as to how data should be analyzed properly (Allison et al., 2006). Raw gene expression level measurements need sophisticated preprocessing (Wu and Irizarry, 2007) encompassing background correction, summarization, normalization (Baldi and Hatfield, 2002; Hochreiter et al., 2006) and missing value imputation (Troyanskaya et al., 2001), which is often done using software available from the chip producer (Affymetrix, 2002). After preprocessing, normalized gene expression levels can be analyzed using feature extraction (Guyon and Elisseeff, 2003) and classification (Dudoit et al., 2002) methods. Any statistical analysis of gene expression probe level data, however, has to face the large N , small M problem setting, where N denotes the number of genes (= features, variables, parameters) and M denotes the number of samples (= experiments, environments, tissues). Also overfitting has to be avoided to construct a classifier with a good generalization ability (Spang et al., 2002). Any robust classifier needs a sampleper-feature (SpF) ratio of 5-to 10-fold, while with usual microarray probe level measurements the SpF amounts to 1/501/200 roughly. Hence a substantial reduction of the feature space dimensionality via gene or feature selection is often the only way out of this SpF dilemma. Traditionally two strategies exist to analyze such sets of gene expression signatures: Supervised approaches and Unsupervised approaches. Supervised approaches afford prior knowledge such as class labels, clinical outcomes, prior densities, etc. and a truly representative set of training data. They are generally used for classification of malignancies within a discriminant analysis. Unsupervised approaches explore correlations in the highdimensional data space and find appropriate transformations to identify relevant subspaces and group observations accordingly. However, such approaches often need additional constraints to yield unique answers but they allow for the detection of new, yet unknown classes (Saidi et al., 2004). For a detailed account of the relevant literature see the extended Introduction in the accompanying Supplementary Material. There is a recent interest in applying exploratory matrix factorization (MF) techniques, like principal component analysis (PCA), independent component analysis (ICA) or non-negative matrix factorization (NMF), to gene expression level measurements with microarrays (Liebermeister, 2002). In this study we propose to include diagnostic knowledge and explore the potential of matrix decomposition techniques to identify and extract marker genes from microarray data sets and classify these datasets according to the diagnostic classes they represent. Note that the feature extraction process via exploratory matrix decomposition techniques is unsupervised, but the identification of the most relevant features follows the supervision of diagnostic information available. Preliminary work along these lines has been presented recently at a conference (Schachtner et al., 2007a). Corresponding supervised feature extraction and classification techniques like support vector machines (SVM) have been applied to the same dataset and are discussed in short as well. For a more detailed discussion of these supervised techniques, though applied to different datasets, see (Schachtner et al., 2007b). THE MONOCYTEMACROPHAGE DATASET For our analysis we combined the gene-chip results from three different experimental settings to the monocytemacrophage (MoMa) dataset (Lutter et al., 2008). In each experiment human peripheral blood monocytes were isolated from healthy donors (Experiment 1 and 2) and from donors with Niemann Pick type C disease (Experiment 3). Monocytes were differentiated to macrophages for 4 days in the presence of M-CSF (50 ng/ml, R&D Systems). Differentiation was confirmed by phase contrast microscopy. Gene-expression profiles were determined using Affymetrix HG-U133A (Experiment 1 and 2) and HG-U133plus2.0 (Experiment 3) Gene Chips covering 22 215 probe sets and about 18 400 transcripts (HG-U133A). Probe sets only covered by HG-U133plus2.0 array were excluded from further analysis. In Experiment 1 pooled RNA was used for hybridization, while in Experiment 2 and 3 RNA from single donors were used. The final dataset consisted of seven monocyte and seven macrophage expression profiles and contained 22 215 probe sets. After filtering out probe sets which had at least one absent call, 5969 probe sets remaine (...truncated)