Simplivariate Models: Ideas and First Examples (pdf)

Article PDF cannot be displayed. You can download it here:

http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0003259&type=printable

Simplivariate Models: Ideas and First Examples

Citation: Hageman JA, Hendriks MMWB, Westerhuis JA, van der Werf MJ, Berger R, et al. ( Simplivariate Models: Ideas and First Examples Jos A. Hageman 0 Margriet M. W. B. Hendriks 0 Johan A. Westerhuis 0 Marie t J. van der Werf 0 Ruud 0 Berger 0 Age K. Smilde 0 Mark Isalan, Center for Genomic Regulation, Spain 0 1 Biosystems Data Analysis, Universiteit van Amsterdam , Amsterdam , The Netherlands , 2 ABC Metabolomics Centre, Lab. Metabolic and Endocrine Diseases, Wilhelmina Children's Hospital , Utrecht , The Netherlands , 3 TNO Quality of Life , Zeist , The Netherlands One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other highdimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework 'simplivariate models' which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of 'divide and conquer', we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of E. coli. Moreover, these simplivariate models were able to uncover regulatory mechanisms present in the phenylalanine biosynthesis route of E. coli. - Funding: This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). The manuscript was reviewed by all members of the SP2.2.3 project group of NBIC. Competing Interests: The authors have declared that no competing interests exist. Modern instrumental methods have been generating a significant advancement in biology research. Especially in the field of functional genomics, transcriptomics and proteomics measurements have provided fundamental insight in many biological processes. The missing link between these measurements and the phenotype is called metabolomics [1]. This new field concerns the measurement of small biomolecules in body fluids, cells, tissues, etc. The type of data being generated in metabolomics studies is characterized by a very broad acquisition of semi-quantitative data of a large number of metabolites [14]. This results in data sets of a very complex nature. Not only are these data sets highdimensional, they also exhibit mixtures of types of variation introduced by the specific experimental setup [5]. Traditionally, a set of measurements is analyzed by postulating a model describing systematic variation and assuming the left-overs (residuals) as being random. Due to the complexity of metabolomics data, this concept breaks down. There are many sources of variation in the data non-informative for the underlying biological question. An example of this type of variation are metabolites which are not under tight regulatory control and are thus allowed to vary almost independently across the experiments [6]. Such non-informative variation affects the data in a structured way and infiltrates the systematic or modeled part of the data. This results in poor interpretability and the failure to unearth subtle informative variation. In this paper, we propose a new conceptual framework for analyzing metabolomics data based on the idea to separate informative from non-informative variation. The informative variation should describe the systematic biological variation in relevant metabolites induced by underlying biological phenomena. What we are ultimately aiming for is to discover these biological phenomena. Our assumption is that the studied biological phenomena are not represented by all measured metabolites, but that simple structures (subsets of related metabolites) in (parts of) the data exist, each simple structure or component describing an underlying biological phenomenon. In the development of our discovery tool we are aiming for a method that fulfills the following requirements: i) being able to identify simple structures, in which just a limited number of metabolites are represented by the structure; ii) representing each simple structure by a model, the type of model depending on the data collected and driven by a priori biological knowledge; iii) assuming that a (large) part of the data will most probably not be informative. The last assumption is reasonable experiments)6J variables (e.g. metabolites)) in components containing subsets of related variables (e.g. metabolites): In which every element xij of matrix X can be written as a sum of contributions from different components. These components Qijk describe the informative parts of the data and can be very diverse in nature. The variation of xij that is not included in factors Qijk- noninformative variation - is indicated by eij. Although the symbol eij is commonly used to indicate random variation, it has a very different meaning here. The non-informative part is certainly non-random in the strict senses of randomness. To introduce the concept of simplicity not all variables are included in the factors Qijk. Here djk indicates the presence of variable j in component k and cik indicates the presence of an object i in component k (djk = 1 if variable j is present in group k, 0 otherwise and cik = 1 if object i is present in group k, 0 otherwise). For simplicity we have used the same symbol Qijk in equations (1) and (2), but their difference is clear from those equations. When decomposing X into simple components, the idea is that interpretation will be easier, since not all original variables are included in those components. Only variables (...truncated)