An integrated workflow for robust alignment and simplified quantitative analysis of NMR spectrometry data
Vu et al. BMC Bioinformatics 2011, 12:405
http://www.biomedcentral.com/1471-2105/12/405
METHODOLOGY ARTICLE
Open Access
An integrated workflow for robust alignment and
simplified quantitative analysis of NMR
spectrometry data
Trung N Vu1,4*, Dirk Valkenborg2,5, Koen Smets1, Kim A Verwaest3, Roger Dommisse3, Filip Lemière3,
Alain Verschoren1, Bart Goethals1 and Kris Laukens1,4
Abstract
Background: Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare
quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the
analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to
data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum
alignment and statistical analysis are indispensable components in any NMR analysis pipeline.
Results: We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data.
The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak
Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by
building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into
smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the
spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method
returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For
each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated
to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis
is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical
inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data.
Conclusions: The workflow performance was evaluated using a previously published dataset. Correlation maps,
spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth
quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular
and statistically sound framework that is implemented as an R package called “speaq” ("spectrum alignment and
quantitation”), which is freely available from http://code.google.com/p/speaq/.
Background
Nuclear magnetic resonance spectroscopy (NMR) is a
powerful and widely applied analytical high-throughput
technique to reveal and compare the quantitative metabolic
profile of a given tissue in relation to various environmental
and clinical parameters. A typical NMR spectrum is composed out of an x-axis, which indicates the resonance frequencies of the observed molecule, and a y-axis, which
* Correspondence:
1
Department of Mathematics and Computer Science, University of Antwerp,
Antwerp, Belgium
Full list of author information is available at the end of the article
denotes the corresponding intensities, i.e., abundance. To
analyse experimental NMR datasets, multivariate methods
such as principle components analysis (PCA) or univariate
techniques like Student-t test are commonly applied. However, chemical and physical sample variations due to,
among others, differences in pH, temperature, ion content
and the concentration of metabolites, make the analysis of
the data challenging. To address these challenges, several
preprocessing steps are commonly applied, including noise
reduction, normalization, baseline correction, peak picking
and spectrum alignment, prior to statistical analysis.
© 2011 Vu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Vu et al. BMC Bioinformatics 2011, 12:405
http://www.biomedcentral.com/1471-2105/12/405
A crucial and often depreciated aspect in this process is
peak alignment, which aims to compensate for small variations in the position of corresponding peaks between spectra. A number of spectral alignment approaches have
previously been proposed. However, most of them come
with particular disadvantages. For example, some methods
use dynamic programming, like Correlation Optimized
Warping (COW) and Dynamic Time Warping (DTW)
[1,2]. Due to their computational complexity an alignment
task based on these techniques may take hours. Several
authors worked towards solutions to speed up this alignment process [3] used a Fast Fourier Transformation
(FFT) cross-correlation engine to improve the alignment
speed (PAFFT). They also introduced an advanced extension, called recursive peak alignment by FFT (RAFFT),
which recursively divides the spectrum into meaningful
segments and aligns them until a certain degree of goodness is obtained. Some advanced peak picking approaches
are Recursive Segment-wise Peak Alignment (RSPA) [4]
and Generalized Fuzzy Hought Transform (GFHT) [5].
Other authors applied search algorithms to peak alignment, such as genetic algorithms in PAGA [6] and beam
searching in PABS [7]. Recently [8], introduced the interval-correlation-shifting (Icoshift) algorithm, which aligns
spectra by maximizing the cross-correlation between userdefined intervals.
Another approach that is commonly employed for the
peak alignment of mass spectral data is based on hierarchical clustering and could be applied as well on NMR
spectral data [9-14]. Most of these methods apply hierarchical clustering to the entire collection of all peaks
from the individual spectra and “cut off” the resulting
dendrogram at a suitable height to produce a number of
clusters used for alignment. This approach works well on
NMR data that is already calibrated to some extent.
However, in some datasets, the peak positions of chemical resonances are significantly shifted between the samples. This strong shift could make the NMR spectra
unclear to separate, which may lead to the wrong clustering, i.e. alignment, of peaks. The effect of strongly shifted
spectra also challenges the methods based on spectral
binning, like COW and Icoshift, because peaks could
mistakenly be assigned to the wrong bins.
To address the problems with misaligned spectra, we
first focus on the development of a robust and highly
confident alignment algorithm. The method is based on a
peak-picking approach for NMR spectra, called hierarchical Cluster-based Peak Alignment (CluPA). The alignment is embedded in a workflow (called speaq:
“spectrum alignment and quanti (...truncated)