SRV: an open-source toolbox to accelerate the recovery of metabolic biomarkers and correlations from metabolic phenotyping datasets
Cl ement Pontoizeau
Benjamin J. Blaise
Associate Editor: Martin Bishop
Plateforme de Recherche de Rovaltain en Toxicologie Environnementale et Ecotoxicologie
, 1 avenue de la Gare, BP 15173, 26958 Valence Cedex 9,
Centre de RMN a` Tr e`s Hauts Champs, Institut des sciences analytiques
, CNRS/ENS Lyon/UCB Lyon 1,
Universite de Lyon
, 5 rue de la Doua, 69100 Villeurbanne,
Motivation: Supervised multivariate statistical analyses are often required to analyze the high-density spectral information in metabolic datasets acquired from complex mixtures in metabolic phenotyping studies. Here we present an implementation of the SRVStatistical Recoupling of Variablesalgorithm as an open-source Matlab and GNU Octave toolbox. SRV allows the identification of similarity between consecutive variables resulting from the high-resolution bucketing. Similar variables are gathered to restore the spectral dependency within the datasets and identify metabolic NMR signals. The correlation and significance of these new NMR variables for a given effect under study can then be measured and represented on a loading plot to allow a visual and efficient identification of candidate biomarkers. Further on, correlations between these candidate biomarkers can be visualized on a two-dimensional pseudospectrum, representing a correlation map, helping to understand the modifications of the underlying metabolic network. Availability: SRV toolbox is encoded in MATLAB R2008A (Mathworks, Natick, MA) and in GNU Octave. It is available free of charge at http:// www.prabi.fr/redmine/projects/srv/repository with a tutorial. Contact: or The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please email: D o w n l o a d e d f r o m h t t p : / / b i o i n f o r m a t i c s . o x f o r d j o u r n a .l s o r g / b y g u e s t o n N o v e m b e r 2 , 2 0 1 4
Given the complexity of metabolic samples used in metabonomics
to understand the metabolic response of an organism to
pathophysiological stimuli (Nicholson et al., 1999), the development of
efficient approaches to fully analyze datasets is a growing field
(Lavine and Workman, 2010). An NMR spectrum typically
contains a few hundreds of resonances over a 10-ppm-wide chemical
shift range. A high-resolution bucketing, using 0.001-ppm-wide
buckets, leads to an efficient sampling of the signal, matching the
high resolution of the NMR spectra, and results in a few thousands
of variables to handle for further statistical analyses. Multivariate
statistical analyses, such as orthogonal partial least squares
(O-PLS) regressions (Trygg and Wold, 2002), are thus efficient
tools to explore these high-density information datasets and
combine collective modifications of metabolite concentrations to allow
a discrimination of the samples with respect to the effect under
However, the interpretation of the latent variables, which are
the results of multivariate statistical analyses, is far from trivial.
The mainly used approach is to combine the representation of
the latent variable with the Pearson correlation coefficients of
each variable with the information matrix, encoding the different
classes (Cloarec et al., 2005). The difficulty with this approach is
that multiple buckets represent a single NMR peak. It thus
shows different levels of correlation, despite the fact that it
represents a single metabolic signal. A second difficulty is the
definition of a correlation threshold above which a signal can be
considered as valuable to designate a candidate biomarker.
Statistical Recoupling of Variables (SRV) is an algorithm
designed to overthrow these main difficulties (Blaise et al., 2009).
The conducting idea is to restore the spectral dependency that
was lost by high-resolution bucketing. The statistical
relationships between consecutive variables allow aggregating them
into clusters following the highest direction of
covariance/correlation ratio, thus defining NMR peaks. Neighbouring clusters can
then be merged into superclusters to recover NMR multiplets.
These superclusters correspond to NMR variables of interest.
SRV thus acts as an automated variable-size bucketing
procedure coupled with an efficient noise-removing filter.
We then use a significance-testing filter using multiple
hypothesis testing corrections. The BenjaminiYekutieli measurement
of the false discovery rate seems adapted to NMR-based
metabonomics (Benjamini and Yekutieli, 2001), but other less strict
corrections could be considered. A typical threshold of 0.05 can
then be defined, and a simple identification of statistically
significant signals allows the recovery of candidate biomarkers.
Based on the statistical total correlation spectroscopy (Cloarec
et al., 2005), it is possible to establish a two-dimensional (2D)
pseudospectrum defined as a correlation map between the
superclusters. These correlations can eventually be interpreted on the
global metabolic network to extract the perturbed metabolic
network associated with a major or minor perturbation (Blaise
et al., 2010, 2011). Here we present an open-source
implementation of the SRV algorithm as MATLAB/GNU Octave functions
leading to the visualization of the latent variables after O-PLS
analysis for the discrimination between two groups, with the
correlation and significance testing representation, and the 2D
pseudospectrum allowing the identification of coordinated
DESCRIPTION OF THE SRV ALGORITHM AND
The SRV algorithm is divided into five steps, as schematically
described in the following text (Blaise et al., 2009). The first three
steps correspond to the statistical mining of metabolite
biomarker signals (i.e. the SRV clusters) from a set of NMR spectra.
Step 1: Definition of a spectral dependency landscape (L) as
the covariance/correlation ratio between neighbouring variables
along the chemical shift axis:
Li ccoorvraerliaatniocne i, i 1 pffivffiffiaffiffirffiffiiffiaffiffiffinffifficffiffieffiffiffiffiiffiffiffiffiffiffiffiffiffivffiffiaffiffirffiffiiffiaffiffiffinffifficffiffieffiffiffiffiiffiffiffiffiffiffiffi1ffiffiffiffi
ut N1 XN i i 2! N1 XN i 1 i 1 2!
Step 2: Identification of spectral SRV clusters.
(i) The first variable of the dataset starts the first cluster.
(ii) The spectral dependency landscape is scanned to identify
local minima of covariance/correlation ratio that represent
the borders between two clusters.
(iii) Clusters representing NMR signals are defined by a
minimum number of variables, which depends on the
resolution of the NMR spectra.
Step 3: Identification of NMR variables.
(iv) Superclusters are based on the aggregation of clusters
depending on their correlation with their neighbouring clusters.
(v) The intensity of the supercluster is the mean of the
intensities of the NMR signal in the buckets assigned to the
Step 4: Evaluation of P-values and multiple dependent tests
correction using the BenjaminiYekutieli false discovery rate
(Benjamini and Yekutieli, 2001). An adjusted P-value threshold
is estimated by the identification of the highest rank verifying the
equation below, where N is the number of variables used in the
model. We then can reject all null hypotheses corresponding to
rank 1 to k.
k max@BBBi 1 : N, pi5 Ni 0P:N051iACCC
The command line is executed as follow: [Data, Xclusterf,
Ibegin, Iend, number of clusters] SRV (X matrix, Y matrix,
typical singlet peak base width, bucketing resolution, correlation
threshold, significance threshold, ppm, number of factors).
We commonly use the analysis of variance for the evaluation of
P-values and the BenjaminiYekutieli correction for the
measurement of the false discovery rate in our NMR metabolic
phenotyping datasets. However, other evaluation procedures can be used
on SRV clusters based on the properties of the variables under
study. SRV clusters are available in the output Xclusterf of the
SRV function. Data is a four-row table containing the ppm line,
the loading value, the correlation value and the significance of
each initial NMR variable. Xclusterf is a matrix containing the
intensity of the signal in each cluster for the different spectra of the
dataset. Ibegin and Iend are tables containing the limits of each
cluster, and S is a table allowing the identification of the initial
NMR signal contained in the SRV clusters and the amount of
signal lost. X matrix is the dataset matrix with spectra in row.
Columns of zeros must represent the excluded residual water
signal area. SRV is not able to deal with multiple exclusion
areas (for instance, NMR buffer signals). In such cases, additional
exclusion areas should simply be removed from the dataset before
the use of SRV. One can eventually choose to reintroduce these
areas in the output matrix of SRV. For the corresponding
chemical shifts, O-PLS coefficients and correlation values should be put
to 0 and P-values to 1. Y matrix is the column vector encoding the
membership of each sample to the groups under study. ppm is a
row vector containing the ppm value of each bucket. Number of
factors is the total number of components for the O-PLS analysis,
including the orthogonal ones. The main output of the function is
the loading plot of the O-PLS analysis (Cloarec et al., 20054) with
a colour code of correlation. Variables that are not statistically
significant are coloured in grey.
Step 5: Computation and visualization of the SRV 2D
pseudospectrum. Finally, after a deflation of the data matrix by
an O-PLS analysis based on the class information matrix (Trygg
and Wold, 2002), we compute the autocorrelation matrix between
SRV clusters identified within the dataset (Cloarec et al., 2005).
[ORStocsy, Correlation table] orstocsy (Xclusterf, Y matrix,
number of factors, correlation threshold, Ibegin, Iend, ppm).
ORStocsy is a N N matrix containing the correlation
between SRV clusters (Blaise et al., 2010, 2011). Correlation table
is a seven-row table containing the identification number of the
correlated clusters, the ppm starting and ending values of the
correlated clusters and the level of correlation. The other
parameters are identical to those described previously.
Funding: The French government.
Conflict of Interest: none declared.