MSClust: a tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data
Y. M. Tikunov
0
1
2
3
4
S. Laptenok
0
1
2
3
4
R. D. Hall
0
1
2
3
4
A. Bovy
0
1
2
3
4
R. C. H. de Vos
0
1
2
3
4
0
Y. M. Tikunov Plant Breeding,
Wageningen University
, 6708 PB Wageningen,
The Netherlands
1
Y. M. Tikunov (&) R. D. Hall A. Bovy R. C. H. de Vos Plant Research International
, 6700 AA Wageningen,
The Netherlands
2
Y. M. Tikunov R. D. Hall A. Bovy R. C. H. de Vos Centre for BioSystems Genomics
, 6700 AB Wageningen,
The Netherlands
3
R. D. Hall R. C. H. de Vos Netherlands Metabolomics Centre
, Einsteinweg 55, 2333 CC Leiden,
The Netherlands
4
S. Laptenok Laboratory of Biophysics, Wageningen University
, Dreijenlaan 3, 6703 HA Wageningen,
The Netherlands
Mass peak alignment (ion-wise alignment) has recently become a popular method for unsupervised data analysis in untargeted metabolic profiling. Here we present MSClusta software tool for analysis GC-MS and LC-MS datasets derived from untargeted profiling. MSClust performs data reduction using unsupervised clustering and extraction of putative metabolite mass spectra from ion-wise chromatographic alignment data. The algorithm is based on the subtractive fuzzy clustering method that allows unsupervised determination of a number of metabolites in a data set and can deal with uncertain memberships of mass peaks in overlapping mass spectra. This Availability and implementation MSClust is freely available for non-commercial users at http://www.metalign.nl.
-
approach is based purely on the actual information present
in the data and does not require any prior metabolite
knowledge. MSClust can be applied for both GCMS and
LCMS alignment data sets.
1 Introduction
In both GCMS and LCMS-based metabolomics
platforms, untargeted data analysis using unbiased mass peak
acquisition followed by their chromatographic alignment,
i.e. ion-wise alignment, has become a popular approach for
comparative metabolomics. Software tools that can
implement this approach, such as MetAlign (Bamba and
Fukusaki 2006; Boccard et al. 2010; De Vos et al. 2007;
Ducruix et al. 2008; Keurentjes et al. 2006; Lommen 2009;
Lommen et al. 2007; Mal et al. 2009; Peters et al. 2009;
Rijk et al. 2009; Tikunov et al. 2005; Tikunov et al. 2010;
Vorst et al. 2005), MZMine (Katajamaa et al. 2006), or
XCMS (Kind et al. 2007; Nordstrom et al. 2006; Smith
et al. 2006; Wikoff et al. 2007), are nowadays widely used
in metabolomics studies. They are used for primary
processing of raw GCMS or LCMS chromatograms (Fig. 1)
and they enable a comprehensive comparative analysis
of complex metabolic mixtures by aligning quantitative
values of individual mass peaks across samples analyzed.
Resulting data matrices can be directly subjected to
comparative analysis using various statistical tools. However,
this approach has a few drawbacks. Firstly, the resulting
mass peak alignment matrices are often extremely large
with a disproportionate variable-to-sample ratio, as the
amount of variables (i.e. detected mass peaks) may reach
Fig. 1 A general workflow of a comparative metabolomics data
analysis which is based on mass peak alignment approach. MSClust
receives a mass peak alignment data matrix of size M 9 S, where M is
a number of mass peaks (often tens thousands) aligned across a number
of samples profiled S. As the result it produces a reduced data matrix of
size C 9 S, where C a number of putative compounds each represented
by a single mass peak (normally a few hundred) aligned across the
same number of samples S. Besides, it extracts a mass spectra for each
of the compounds C, that in case of GCMS data is compatible with the
NIST MSSearch compound identification software
tens of thousands. Up to 90% of the variables may be
redundant, since each metabolite will be represented by a
number of different mass peaks, including molecular
fragments, adducts, molecular fragments and isotopes
thereof. Moreover, this redundancy may vary between
profiling platforms and metabolites, depending upon their
concentration, ionization efficiency and specific chemical
nature. This leads to an unequal representation of
metabolites in the dataset and complicates subsequent
multivariate or statistical analyses. Secondly, a direct interpretation
of the experimental results is hardly possible, because both
the structural information of a metabolite, such as a mass
spectrum in case of GCMS and in-source fragments in
case of LCMS, is not provided directly as a result of the
alignment.
Previously, we have reported a mass signal correlation
analysis approach that can reduce the metabolite signal
redundancy in untargeted ion-wise aligned GCMS
datasets and to extract mass spectra of individual metabolites
without using mass spectral libraries or other structural
sources (Tikunov et al. 2005). Here we present a
computational implementation of this approachMSClust. In an
untargeted metabolomics data analysis workflow it can be
placed between the mass peak alignment step and
metabolite identification followed by data interpretation.
MSClust clusters the aligned mass peaks into reconstructed
metabolites, thereby (i) reducing the signal redundancy per
metabolite into single representative variables, and (ii)
reconstructing original mass spectra, thus providing
structural information of the metabolites. This MSClust
software tool can be applied to both GCMS and
LCMSderived datasets, and for both nominal mass and accurate
mass data. The MSClust tool aligns with the Metabolomics
Standards Initiative for data processing.
2 Method and implementation
The MSClust algorithm aims to remove metabolite signal
redundancy in aligned mass peaks tables and to retrieve mass
spectral information of metabolites using mass peak
clustering. Many clustering methods, e.g. k-means or c-means
clustering, self-organizing maps etc., require prior knowledge
about a number of clusters in the data. Therefore, these
methods cannot be used for chromatography-mass
spectrometry data clustering as a number of metabolites is
unknown and may vary from tens to hundreds from
experiment to experiment. The subtractive fuzzy clustering (Chiu
1994) implemented in the MSClust algorithm allows
unsupervised determination of a number of clusters and
simultaneous clustering of mass peaks in the mass peak alignment
data. The algorithm of MSClust performs clustering of
ion-fragments in the dataset that originate from a single
metabolite, based on two properties: (i) similarity of
chromatography, i.e. retention time span covered by a
chromatographic peak of a metabolite, and (ii) quantitative
similarity of ion-fragment patterns across a number of samples
analyzed. The algorithm performs the following tasks:
A number of mass peak clusters (putative
metabolites) present in an ion-wise alignment data matrix
and cluster centers (centrotype mass peaks) are
determined in an unsupervised manner using the
potential density (PD) method (Chiu 1994) (Fig. 2A,
B) (for detailed explanation of the algorithm see User
Manual, Supplemental Data).
All mass peaks ar (...truncated)