Accelerating in silico saturation mutagenesis using compressed sensing.
Bioinformatics, 38(14), 2022, 3557–3564
https://doi.org/10.1093/bioinformatics/btac385
Advance Access Publication Date: 9 June 2022
Original Paper
Sequence analysis
Accelerating in silico saturation mutagenesis using
compressed sensing
Jacob Schreiber
Anshul Kundaje
1,
*, Surag Nair
1,2,
*
2
, Akshay Balsubramani2 and
1
Department of Genetics, Stanford University, Stanford, CA 94305, USA and 2Department of Computer Science, Stanford University,
Stanford, CA 94305, USA
*To whom correspondence should be addressed.
Associate Editor: Pier Luigi Martelli
Received on March 8, 2022; revised on May 10, 2022; editorial decision on May 21, 2022; accepted on June 6, 2022
Abstract
Motivation: In silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating
feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence
and recording the difference in model output. However, this method can be slow because systematically perturbing
each position requires performing a number of forward passes proportional to the length of the sequence being
examined.
Results: In this work, we propose a modification of ISM that leverages the principles of compressed sensing to
require only a constant number of forward passes, regardless of sequence length, when applied to models that
contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the
time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed
up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that
Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the
sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings.
Availability and implementation: We have made this tool available at https://github.com/kundajelab/yuzu.
Contact: or
1 Introduction
A challenge with using modern machine learning methods in practice
is that, frequently, their learned logic for transforming input features
into output predictions is opaque and difficult for humans to understand. Accordingly, principled approaches for explaining trained machine learning models have been proposed that, for a given example,
assign a numerical value to each feature according to some notion of
importance in the resulting prediction. Unsurprisingly, a large number
of these feature attribution methods have been proposed, but we have
seen three main classes of feature attribution methods: gradient-based
(Shrikumar et al., 2017; Simonyan et al., 2014; Springenberg et al.,
2015; Zeiler and Fergus, 2014), path-based (Jha et al., 2020;
Sundararajan et al., 2017) and counterfactual- or perturbation-based
(Lundberg and Lee, 2017; Nair et al., 2022; Ribeiro et al., 2016).
These approaches have trade-offs, both in terms of theoretical guarantees and in terms of speed in practice. For example, gradient-based
methods generally require one backward pass to explain each output
from the model, whereas perturbation-based methods generally require one forward pass to explain each input. However, all three
classes have the common goal of assigning to each feature in an
example a value that corresponds to the relevance of that feature
to the output from the model. When applied in a genomics setting,
feature attribution methods are a straightforward approach for
identifying the nucleotides, amino acids, and motifs of such, that
form the core of biochemical mechanisms or interactions (Mui~
nos
et al., 2021; Öhlknecht et al., 2021; Ponzoni et al., 2020; Schreiber
and Singh, 2021).
A simple perturbation-based method, in silico saturation mutagenesis (ISM), proceeds on biological sequences by constructing mutant sequences that each contain one mutation relative to the
reference sequence that attributions are being calculated for. These
mutants, along with the reference sequence, are then all run through
a model. Unlike gradient-based methods, this model does not need
to be continuous or differentiable. The attribution values are then
calculated as the difference in output between the mutant sequences
and the reference sequence. When the input features are categories,
such as nucleotide or amino acid identity, the method has the
straightforward interpretation of performing a saturated mutagenesis experiment computationally (hence, the name) (Patwardhan
et al., 2009). A strength of this method is that the number of forward passes does not depend on the number of output tasks.
C The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail:
V
3557
3558
J.Schreiber et al.
In parallel with developments in feature attribution methods,
progress has also been made in the field of compressed (or compressive) sensing (Boche et al., 2015b; Candès, 2008; Kutyniok, 2013).
This field concerns the replacement of a large number of sparse
measurements with a smaller number of dense probes where each
probe measures a linear combination of the original, sparse, measurements. For example, rather than performing millions of diagnostic tests, each on one person, one would pool together results such
that each pool is made up of multiple individuals and each individual contributes to multiple pools. Through principled pool design,
one can achieve perfect recovery of the results that each individual
test would have given by only measuring the pools and deconvolving
the results given the known pool design, effectively increasing the
number of individuals that can be tested with the same resources.
Compressed sensing has been used to speed up several algorithms
and data collection tools that involve sparse values (Boche et al.,
2015a; Li and Durbin, 2009; Zhu et al., 2012).
An interesting property emerges when ISM is applied to neural
networks that contain convolution operations: the difference in the
convolution output between the reference sequence and each of the
mutants is sparse. This sparsity arises because the convolution operation has a limited receptive field and changes to a single input feature
cannot influence the output past that field. Consequently, naive ISM
wastes a significant amount of computational time recalculating layer
outputs for each mutant sequence that, by definition, must be identical
to the layer outputs for the reference sequence. A previous approach,
fastISM (Nair et al., 2022), leveraged this property to justify only
recalculating layer outputs that are within the receptive field of the
mutation. However, fastISM involves running the same number of
convolution operations, albeit restricted to subsets of the sequence the
size of the receptive field. Because the number of forward passes
remains unchanged, in practice, fastISM requires a large batch size to
achieve speedups and, due to implementation details, is usu (...truncated)