SIMILE enables alignment of tandem mass spectra with statistical significance
ARTICLE
https://doi.org/10.1038/s41467-022-30118-9
OPEN
SIMILE enables alignment of tandem mass spectra
with statistical significance
1234567890():,;
Daniel G. C. Treen 1, Mingxun Wang2, Shipei Xing3, Katherine B. Louie1, Tao Huan3, Pieter C. Dorrestein
Trent R. Northen 1 & Benjamin P. Bowen 1 ✉
2,
Interrelating small molecules according to their aligned fragmentation spectra is central to
tandem mass spectrometry-based untargeted metabolomics. Current alignment algorithms
do not provide statistical significance and compounds that have multiple delocalized structural differences and therefore often fail to have their fragment ions aligned. Here we align
fragmentation spectra with both statistical significance and allowance for multiple chemical
differences using Significant Interrelation of MS/MS Ions via Laplacian Embedding (SIMILE).
SIMILE yields spectral alignment inferred structural connections in molecular networks that
are not found with cosine-based scoring algorithms. In addition, it is now possible to rank
spectral alignments based on p-values in the exploration of structural relationships between
compounds and enhance the chemical connectivity that can be obtained with molecular
networking.
1 Environmental Genomics and Systems Biology Division & The Joint Genome Institute Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley,
CA 94720, United States. 2 Collaborative Mass Spectrometry Innovation Center, Skagss school of Pharmacy and Pharmaceutical Sciences, Departments of
Pharmacology and Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, United States. 3 Department of Chemistry, Faculty
of Science, University of British Columbia, Vancouver Campus, 2036 Main Mall, Vancouver V6T 1Z1 BC, Canada. ✉email:
NATURE COMMUNICATIONS | (2022)13:2510 | https://doi.org/10.1038/s41467-022-30118-9 | www.nature.com/naturecommunications
1
ARTICLE
T
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-30118-9
andem mass spectrometry is widely used in metabolomics
experiments to hypothesize chemical structures. This is
done by aligning fragment ions that share the same massto-charge ratio (m/z) and calculating the cosine similarity of their
intensities1. Such compound identification often requires determining if an experimental fragmentation spectrum matches an
authentic standard with the annotated data.
Recently, alignment approaches have been developed that aim
to yield scores that are a proxy for compound similarity rather
than identity. For instance, GNPS-based molecular networking
and NIST Hybrid Search both implement an alignment approach
that is sensitive to compounds that differ by a single/localized
structural difference(s)2–4. The general logic for these two
approaches is as follows: when a pair of related molecules are
fragmented, their fragmentation data are likely similar. Under the
assumption that the difference in masses stems from a single/
localized structural difference and does not alter the fragmentation process of the molecule, the structural difference can either
be attached to charged fragments or localized modifications that
are reflected in neutral mass additions in the fragment ions (e.g., a
lipid may have an additional mass of 14, 26, or 28 Da representing
CH2, CH=CH, or CH2–CH2 additions). The charged fragments
are directly observed as m/z’s in the fragmentation spectrum,
while the neutral fragments can be indirectly observed as neutral
losses by subtracting (or adding) the fragment m/z’s from their
precursor m/z. Therefore, when the assumptions hold as is often
the case, two fragments from different molecules can be aligned if
they share the same m/z or the same m/z difference with respect
to their precursor m/z. More recently, a concept of hypothetical
neutral loss is proposed to further align neutral losses from pairs
of fragment ions, showing significantly improved correlation
between spectral and structural similarities5. Alignment approaches on fragmentation data have also proven useful for mass
spectrometry-based proteomics by identifying pairs of peptides
that differ by multiple modifications6–8.
Machine-learning approaches such as SIRIUS, CANOPUS,
MS2LDA, and Spec2Vec also incorporate precursor ion neutral
losses as a feature in their implementations9–12. Recent implementations combining machine learning with in silico structural
database searching allow exploring high-confidence identifications to explore biochemistry outside of known chemical
databases13. Other tools have enabled the false-discovery rates
from tandem mass spectra database searches to separate correct
from incorrect hits through false-discovery rate assignments
(analogous to decoy database searching in proteomics). While
there are methods for estimating statistical significance for
compound identification, to our knowledge, no method for calculating the significance of fragmentation-spectra alignments
from a pair of spectra has been described14,15.
Protein-sequence alignment algorithms like Needleman–Wunsch,
Smith–Waterman, and BLAST yield alignments with statistical significance that are robust to multiple substitutions, insertions, and
deletions16,17. These methods are fundamentally different from
fragmentation-spectra-based cosine similarity in that they rely on
substitution matrices describing the log odds of amino acids sharing
common ancestry relative to random chance such as the PAM and
BLOSUM matrices18,19. These approaches have not been widely
applied to fragmentation data for two reasons: first, unlike proteinsubstitution matrices that are generally of size 20 by 20 (amino
acids), a global substitution matrix for fragment ions would be
infinite due to the infinite number of possible m/z values; and second, m/z values are only partially tied to chemical structure due to
the one-to-many correspondence between m/z values and chemical
structures. However, if restricted to a single pair of fragmentation
spectra, a spectral graph-theoretic framework parameterized by their
all-by-all m/z difference counts can generate finite, context sensitive,
2
and mathematically consistent fragment ion similarity matrices
based on average commute times20.
Here, we introduce Significant Interrelation of MS/MS Ions via
Laplacian Embedding (SIMILE), an approach that leverages
methods used for protein-sequence alignment to enable robust
pairwise alignment of fragmentation spectra with p-value estimation (Fig. 1). Rather than requiring identical m/z values or
precursor ion neutral losses for alignment of fragmentation
spectra, SIMILE uses all m/z differences among a pair of fragmentation spectra to generate a pair-specific fragment ion similarity matrix. This matrix is then used as the input to a dynamic
programming alignment algorithm for alignment and scoring.
The significance of an alignment is calculated via a Monte Carlo
permutation test with alignment score as the test statistic under
the null hypothesis that m/z values are exchangea (...truncated)