SIMILE enables alignment of tandem mass spectra with statistical significance

Nature Communications, Jun 2022

Interrelating small molecules according to their aligned fragmentation spectra is central to tandem mass spectrometry-based untargeted metabolomics. Current alignment algorithms do not provide statistical significance and compounds that have multiple delocalized structural differences and therefore often fail to have their fragment ions aligned. Here we align fragmentation spectra with both statistical significance and allowance for multiple chemical differences using Significant Interrelation of MS/MS Ions via Laplacian Embedding (SIMILE). SIMILE yields spectral alignment inferred structural connections in molecular networks that are not found with cosine-based scoring algorithms. In addition, it is now possible to rank spectral alignments based on p-values in the exploration of structural relationships between compounds and enhance the chemical connectivity that can be obtained with molecular networking.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41467-022-30118-9.pdf

SIMILE enables alignment of tandem mass spectra with statistical significance

ARTICLE https://doi.org/10.1038/s41467-022-30118-9 OPEN SIMILE enables alignment of tandem mass spectra with statistical significance 1234567890():,; Daniel G. C. Treen 1, Mingxun Wang2, Shipei Xing3, Katherine B. Louie1, Tao Huan3, Pieter C. Dorrestein Trent R. Northen 1 & Benjamin P. Bowen 1 ✉ 2, Interrelating small molecules according to their aligned fragmentation spectra is central to tandem mass spectrometry-based untargeted metabolomics. Current alignment algorithms do not provide statistical significance and compounds that have multiple delocalized structural differences and therefore often fail to have their fragment ions aligned. Here we align fragmentation spectra with both statistical significance and allowance for multiple chemical differences using Significant Interrelation of MS/MS Ions via Laplacian Embedding (SIMILE). SIMILE yields spectral alignment inferred structural connections in molecular networks that are not found with cosine-based scoring algorithms. In addition, it is now possible to rank spectral alignments based on p-values in the exploration of structural relationships between compounds and enhance the chemical connectivity that can be obtained with molecular networking. 1 Environmental Genomics and Systems Biology Division & The Joint Genome Institute Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, United States. 2 Collaborative Mass Spectrometry Innovation Center, Skagss school of Pharmacy and Pharmaceutical Sciences, Departments of Pharmacology and Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, United States. 3 Department of Chemistry, Faculty of Science, University of British Columbia, Vancouver Campus, 2036 Main Mall, Vancouver V6T 1Z1 BC, Canada. ✉email: NATURE COMMUNICATIONS | (2022)13:2510 | https://doi.org/10.1038/s41467-022-30118-9 | www.nature.com/naturecommunications 1 ARTICLE T NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-022-30118-9 andem mass spectrometry is widely used in metabolomics experiments to hypothesize chemical structures. This is done by aligning fragment ions that share the same massto-charge ratio (m/z) and calculating the cosine similarity of their intensities1. Such compound identification often requires determining if an experimental fragmentation spectrum matches an authentic standard with the annotated data. Recently, alignment approaches have been developed that aim to yield scores that are a proxy for compound similarity rather than identity. For instance, GNPS-based molecular networking and NIST Hybrid Search both implement an alignment approach that is sensitive to compounds that differ by a single/localized structural difference(s)2–4. The general logic for these two approaches is as follows: when a pair of related molecules are fragmented, their fragmentation data are likely similar. Under the assumption that the difference in masses stems from a single/ localized structural difference and does not alter the fragmentation process of the molecule, the structural difference can either be attached to charged fragments or localized modifications that are reflected in neutral mass additions in the fragment ions (e.g., a lipid may have an additional mass of 14, 26, or 28 Da representing CH2, CH=CH, or CH2–CH2 additions). The charged fragments are directly observed as m/z’s in the fragmentation spectrum, while the neutral fragments can be indirectly observed as neutral losses by subtracting (or adding) the fragment m/z’s from their precursor m/z. Therefore, when the assumptions hold as is often the case, two fragments from different molecules can be aligned if they share the same m/z or the same m/z difference with respect to their precursor m/z. More recently, a concept of hypothetical neutral loss is proposed to further align neutral losses from pairs of fragment ions, showing significantly improved correlation between spectral and structural similarities5. Alignment approaches on fragmentation data have also proven useful for mass spectrometry-based proteomics by identifying pairs of peptides that differ by multiple modifications6–8. Machine-learning approaches such as SIRIUS, CANOPUS, MS2LDA, and Spec2Vec also incorporate precursor ion neutral losses as a feature in their implementations9–12. Recent implementations combining machine learning with in silico structural database searching allow exploring high-confidence identifications to explore biochemistry outside of known chemical databases13. Other tools have enabled the false-discovery rates from tandem mass spectra database searches to separate correct from incorrect hits through false-discovery rate assignments (analogous to decoy database searching in proteomics). While there are methods for estimating statistical significance for compound identification, to our knowledge, no method for calculating the significance of fragmentation-spectra alignments from a pair of spectra has been described14,15. Protein-sequence alignment algorithms like Needleman–Wunsch, Smith–Waterman, and BLAST yield alignments with statistical significance that are robust to multiple substitutions, insertions, and deletions16,17. These methods are fundamentally different from fragmentation-spectra-based cosine similarity in that they rely on substitution matrices describing the log odds of amino acids sharing common ancestry relative to random chance such as the PAM and BLOSUM matrices18,19. These approaches have not been widely applied to fragmentation data for two reasons: first, unlike proteinsubstitution matrices that are generally of size 20 by 20 (amino acids), a global substitution matrix for fragment ions would be infinite due to the infinite number of possible m/z values; and second, m/z values are only partially tied to chemical structure due to the one-to-many correspondence between m/z values and chemical structures. However, if restricted to a single pair of fragmentation spectra, a spectral graph-theoretic framework parameterized by their all-by-all m/z difference counts can generate finite, context sensitive, 2 and mathematically consistent fragment ion similarity matrices based on average commute times20. Here, we introduce Significant Interrelation of MS/MS Ions via Laplacian Embedding (SIMILE), an approach that leverages methods used for protein-sequence alignment to enable robust pairwise alignment of fragmentation spectra with p-value estimation (Fig. 1). Rather than requiring identical m/z values or precursor ion neutral losses for alignment of fragmentation spectra, SIMILE uses all m/z differences among a pair of fragmentation spectra to generate a pair-specific fragment ion similarity matrix. This matrix is then used as the input to a dynamic programming alignment algorithm for alignment and scoring. The significance of an alignment is calculated via a Monte Carlo permutation test with alignment score as the test statistic under the null hypothesis that m/z values are exchangea (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41467-022-30118-9.pdf
Article home page: https://www.nature.com/articles/s41467-022-30118-9

Treen, Daniel G. C., Wang, Mingxun, Xing, Shipei, Louie, Katherine B., Huan, Tao, Dorrestein, Pieter C., Northen, Trent R., Bowen, Benjamin P.. SIMILE enables alignment of tandem mass spectra with statistical significance, Nature Communications, DOI: 10.1038/s41467-022-30118-9