Accelerating in silico saturation mutagenesis using compressed sensing. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9272795/pdf/

Accelerating in silico saturation mutagenesis using compressed sensing.

Bioinformatics, 38(14), 2022, 3557–3564 https://doi.org/10.1093/bioinformatics/btac385 Advance Access Publication Date: 9 June 2022 Original Paper Sequence analysis Accelerating in silico saturation mutagenesis using compressed sensing Jacob Schreiber Anshul Kundaje 1, *, Surag Nair 1,2, * 2 , Akshay Balsubramani2 and 1 Department of Genetics, Stanford University, Stanford, CA 94305, USA and 2Department of Computer Science, Stanford University, Stanford, CA 94305, USA *To whom correspondence should be addressed. Associate Editor: Pier Luigi Martelli Received on March 8, 2022; revised on May 10, 2022; editorial decision on May 21, 2022; accepted on June 6, 2022 Abstract Motivation: In silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined. Results: In this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings. Availability and implementation: We have made this tool available at https://github.com/kundajelab/yuzu. Contact: or 1 Introduction A challenge with using modern machine learning methods in practice is that, frequently, their learned logic for transforming input features into output predictions is opaque and difficult for humans to understand. Accordingly, principled approaches for explaining trained machine learning models have been proposed that, for a given example, assign a numerical value to each feature according to some notion of importance in the resulting prediction. Unsurprisingly, a large number of these feature attribution methods have been proposed, but we have seen three main classes of feature attribution methods: gradient-based (Shrikumar et al., 2017; Simonyan et al., 2014; Springenberg et al., 2015; Zeiler and Fergus, 2014), path-based (Jha et al., 2020; Sundararajan et al., 2017) and counterfactual- or perturbation-based (Lundberg and Lee, 2017; Nair et al., 2022; Ribeiro et al., 2016). These approaches have trade-offs, both in terms of theoretical guarantees and in terms of speed in practice. For example, gradient-based methods generally require one backward pass to explain each output from the model, whereas perturbation-based methods generally require one forward pass to explain each input. However, all three classes have the common goal of assigning to each feature in an example a value that corresponds to the relevance of that feature to the output from the model. When applied in a genomics setting, feature attribution methods are a straightforward approach for identifying the nucleotides, amino acids, and motifs of such, that form the core of biochemical mechanisms or interactions (Mui~ nos et al., 2021; Öhlknecht et al., 2021; Ponzoni et al., 2020; Schreiber and Singh, 2021). A simple perturbation-based method, in silico saturation mutagenesis (ISM), proceeds on biological sequences by constructing mutant sequences that each contain one mutation relative to the reference sequence that attributions are being calculated for. These mutants, along with the reference sequence, are then all run through a model. Unlike gradient-based methods, this model does not need to be continuous or differentiable. The attribution values are then calculated as the difference in output between the mutant sequences and the reference sequence. When the input features are categories, such as nucleotide or amino acid identity, the method has the straightforward interpretation of performing a saturated mutagenesis experiment computationally (hence, the name) (Patwardhan et al., 2009). A strength of this method is that the number of forward passes does not depend on the number of output tasks. C The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: V 3557 3558 J.Schreiber et al. In parallel with developments in feature attribution methods, progress has also been made in the field of compressed (or compressive) sensing (Boche et al., 2015b; Candès, 2008; Kutyniok, 2013). This field concerns the replacement of a large number of sparse measurements with a smaller number of dense probes where each probe measures a linear combination of the original, sparse, measurements. For example, rather than performing millions of diagnostic tests, each on one person, one would pool together results such that each pool is made up of multiple individuals and each individual contributes to multiple pools. Through principled pool design, one can achieve perfect recovery of the results that each individual test would have given by only measuring the pools and deconvolving the results given the known pool design, effectively increasing the number of individuals that can be tested with the same resources. Compressed sensing has been used to speed up several algorithms and data collection tools that involve sparse values (Boche et al., 2015a; Li and Durbin, 2009; Zhu et al., 2012). An interesting property emerges when ISM is applied to neural networks that contain convolution operations: the difference in the convolution output between the reference sequence and each of the mutants is sparse. This sparsity arises because the convolution operation has a limited receptive field and changes to a single input feature cannot influence the output past that field. Consequently, naive ISM wastes a significant amount of computational time recalculating layer outputs for each mutant sequence that, by definition, must be identical to the layer outputs for the reference sequence. A previous approach, fastISM (Nair et al., 2022), leveraged this property to justify only recalculating layer outputs that are within the receptive field of the mutation. However, fastISM involves running the same number of convolution operations, albeit restricted to subsets of the sequence the size of the receptive field. Because the number of forward passes remains unchanged, in practice, fastISM requires a large batch size to achieve speedups and, due to implementation details, is usu (...truncated)