Simplivariate Models: Ideas and First Examples
Citation: Hageman JA, Hendriks MMWB, Westerhuis JA, van der Werf MJ, Berger R, et al. (
Simplivariate Models: Ideas and First Examples
Jos A. Hageman 0
Margriet M. W. B. Hendriks 0
Johan A. Westerhuis 0
Marie t J. van der Werf 0
Ruud 0
Berger 0
Age K. Smilde 0
Mark Isalan, Center for Genomic Regulation, Spain
0 1 Biosystems Data Analysis, Universiteit van Amsterdam , Amsterdam , The Netherlands , 2 ABC Metabolomics Centre, Lab. Metabolic and Endocrine Diseases, Wilhelmina Children's Hospital , Utrecht , The Netherlands , 3 TNO Quality of Life , Zeist , The Netherlands
One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other highdimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework 'simplivariate models' which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of 'divide and conquer', we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of E. coli. Moreover, these simplivariate models were able to uncover regulatory mechanisms present in the phenylalanine biosynthesis route of E. coli.
-
Funding: This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the
Netherlands Genomics Initiative (NGI). The manuscript was reviewed by all members of the SP2.2.3 project group of NBIC.
Competing Interests: The authors have declared that no competing interests exist.
Modern instrumental methods have been generating a
significant advancement in biology research. Especially in the field of
functional genomics, transcriptomics and proteomics
measurements have provided fundamental insight in many biological
processes. The missing link between these measurements and the
phenotype is called metabolomics [1]. This new field concerns the
measurement of small biomolecules in body fluids, cells, tissues,
etc. The type of data being generated in metabolomics studies is
characterized by a very broad acquisition of semi-quantitative data
of a large number of metabolites [14]. This results in data sets of
a very complex nature. Not only are these data sets
highdimensional, they also exhibit mixtures of types of variation
introduced by the specific experimental setup [5].
Traditionally, a set of measurements is analyzed by postulating a
model describing systematic variation and assuming the left-overs
(residuals) as being random. Due to the complexity of metabolomics
data, this concept breaks down. There are many sources of variation
in the data non-informative for the underlying biological question. An
example of this type of variation are metabolites which are not under
tight regulatory control and are thus allowed to vary almost
independently across the experiments [6]. Such non-informative
variation affects the data in a structured way and infiltrates the
systematic or modeled part of the data. This results in poor
interpretability and the failure to unearth subtle informative variation.
In this paper, we propose a new conceptual framework for analyzing
metabolomics data based on the idea to separate informative from
non-informative variation. The informative variation should
describe the systematic biological variation in relevant metabolites
induced by underlying biological phenomena. What we are
ultimately aiming for is to discover these biological phenomena.
Our assumption is that the studied biological phenomena are
not represented by all measured metabolites, but that simple
structures (subsets of related metabolites) in (parts of) the data exist,
each simple structure or component describing an underlying
biological phenomenon. In the development of our discovery tool
we are aiming for a method that fulfills the following requirements:
i) being able to identify simple structures, in which just a limited
number of metabolites are represented by the structure; ii)
representing each simple structure by a model, the type of model
depending on the data collected and driven by a priori biological
knowledge; iii) assuming that a (large) part of the data will most
probably not be informative. The last assumption is reasonable
experiments)6J variables (e.g. metabolites)) in components
containing subsets of related variables (e.g. metabolites):
In which every element xij of matrix X can be written as a sum of
contributions from different components. These components Qijk
describe the informative parts of the data and can be very diverse in
nature. The variation of xij that is not included in factors Qijk-
noninformative variation - is indicated by eij. Although the symbol eij is
commonly used to indicate random variation, it has a very different
meaning here. The non-informative part is certainly non-random in
the strict senses of randomness. To introduce the concept of
simplicity not all variables are included in the factors Qijk.
Here djk indicates the presence of variable j in component k and
cik indicates the presence of an object i in component k (djk = 1 if
variable j is present in group k, 0 otherwise and cik = 1 if object i is
present in group k, 0 otherwise).
For simplicity we have used the same symbol Qijk in equations (1)
and (2), but their difference is clear from those equations.
When decomposing X into simple components, the idea is that
interpretation will be easier, since not all original variables are
included in those components. Only variables (...truncated)