Deconvolution of autoencoders to learn biological regulatory modules from single cell mRNA sequencing data
Kinalis et al. BMC Bioinformatics
(2019) 20:379
https://doi.org/10.1186/s12859-019-2952-9
METHODOLOGY ARTICLE
Open Access
Deconvolution of autoencoders to learn
biological regulatory modules from single
cell mRNA sequencing data
Savvas Kinalis1, Finn Cilius Nielsen1, Ole Winther1,2,3 and Frederik Otzen Bagger1,4,5*
Abstract
Background: Unsupervised machine learning methods (deep learning) have shown their usefulness with noisy
single cell mRNA-sequencing data (scRNA-seq), where the models generalize well, despite the zero-inflation of the
data. A class of neural networks, namely autoencoders, has been useful for denoising of single cell data, imputation
of missing values and dimensionality reduction.
Results: Here, we present a striking feature with the potential to greatly increase the usability of autoencoders:
With specialized training, the autoencoder is not only able to generalize over the data, but also to tease apart
biologically meaningful modules, which we found encoded in the representation layer of the network. Our model
can, from scRNA-seq data, delineate biological meaningful modules that govern a dataset, as well as give information
as to which modules are active in each single cell. Importantly, most of these modules can be explained by known
biological functions, as provided by the Hallmark gene sets.
Conclusions: We discover that tailored training of an autoencoder makes it possible to deconvolute biological
modules inherent in the data, without any assumptions. By comparisons with gene signatures of canonical pathways
we see that the modules are directly interpretable. The scope of this discovery has important implications, as it makes
it possible to outline the drivers behind a given effect of a cell. In comparison with other dimensionality reduction
methods, or supervised models for classification, our approach has the benefit of both handling well the zero-inflated
nature of scRNA-seq, and validating that the model captures relevant information, by establishing a link between input
and decoded data. In perspective, our model in combination with clustering methods is able to provide information
about which subtype a given single cell belongs to, as well as which biological functions determine that membership.
Keywords: Interpretable machine learning, Deep learning, Neural networks, Manifold learning, Expression profiles,
Single-cell RNA-sequencing, Gene set enrichment analysis, Functional analysis, Biological pathway analysis
Background
Recent upsurge of data generated by mRNA sequencing
at the single cell level (scRNA-seq) have helped to address a number of scientific questions and have also revealed new challenges. It allows researchers to look into
gene expression levels of a specific cell, rather than the
aggregated levels that came with “bulk” RNA sequencing, and create fine molecular profiles of tissues, that
* Correspondence:
1
Centre for Genomic Medicine Rigshospitalet, University of Copenhagen,
Copenhagen, Denmark
4
University Children’s Hospital Basel and Department of Biomedicine,
University of Basel, Basel, Switzerland
Full list of author information is available at the end of the article
are particularly important for insights into the dynamics
and function of more heterogeneous tissues, such as
cancer tissues.
Using scRNA-seq it has been possible to delineate
cellular populations in an unbiased manner from several healthy [1–4] and diseased tissue [5, 6], and a
large number of new methods have addressed the
new computational and analytical challenges with this
data type [7–9].
Modeling of the scRNA-seq data is challenging because relevant and often categorical biological signal is
usually intertwined with dynamical biological processes
(i.e. cell cycle, maturation, differentiation or metabolic
activity) as well as technical sources of variation (i.e.
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Kinalis et al. BMC Bioinformatics
(2019) 20:379
PCR amplification, “dropout” events, sequencing or library preparation variation tissue dissociation and many
parameters related to laboratory protocol).
Recently, there have been several excellent attempts to
model scRNA-seq data using prior knowledge on specific sources of variation [10, 11]. In this study, however,
our aim is to extract biological information from a class
of more general, non-linear models, that can assimilate
the information of the manifold shaped by the single-cell
expression profiles.
Artificial neural networks (NN) have proven flexible
and demonstrated representational power and state of
the art results in many applications (i.e. skin cancer classification [12], retinal disease diagnosis [13], protein
folding [14, 15]). In addition, recent advancements in
the development of software frameworks that efficiently
exploit computing resources, mostly by parallel processing on GPU, render the definition, implementation and
training of a NN quite straightforward.
We hypothesise that simple NN layouts and stringent
training will make deconvolution possible and tease
apart biological signal from heterogeneous cellular populations. We believe that the distributed nature of NN
models bears the potential of encapsulating, rather than
smoothing over or regressing out sources of variation,
both biological and technical.
In this study we applied autoencoder neural networks
[16], unsupervised machine learning methods, to
scRNA-seq expression counts. This class of models are
used as a manifold learning technique and are able to efficiently capture the underlying signal even when the input is perturbed or zeroed out [17], which is particularly
appealing for an application to scRNA-seq data. Variants
of autoencoders have been successfully applied to
scRNA-seq data before, for dimensionality reduction,
denoising and imputation of missing values (see [18–26]
for a complete list of studies).
Here, we will make use of a simple autoencoder architecture and apply methods from the computer graphics community, known as saliency maps [27], aiming to
deconvolute what the latent representation of the model
captures, and to interpret it in terms of biological pathways.
Results
A simple autoencoder with three layers (input layer, a
hidden or representation layer and an output layer) can
be seen on Fig. 1b. Each layer consists of a number of
units, corresponding to its dimensionality. Briefly, an
autoencoder is trained to learn how to recre (...truncated)