Deconvolution of autoencoders to learn biological regulatory modules from single cell mRNA sequencing data (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-2952-9

Deconvolution of autoencoders to learn biological regulatory modules from single cell mRNA sequencing data

Kinalis et al. BMC Bioinformatics (2019) 20:379 https://doi.org/10.1186/s12859-019-2952-9 METHODOLOGY ARTICLE Open Access Deconvolution of autoencoders to learn biological regulatory modules from single cell mRNA sequencing data Savvas Kinalis1, Finn Cilius Nielsen1, Ole Winther1,2,3 and Frederik Otzen Bagger1,4,5* Abstract Background: Unsupervised machine learning methods (deep learning) have shown their usefulness with noisy single cell mRNA-sequencing data (scRNA-seq), where the models generalize well, despite the zero-inflation of the data. A class of neural networks, namely autoencoders, has been useful for denoising of single cell data, imputation of missing values and dimensionality reduction. Results: Here, we present a striking feature with the potential to greatly increase the usability of autoencoders: With specialized training, the autoencoder is not only able to generalize over the data, but also to tease apart biologically meaningful modules, which we found encoded in the representation layer of the network. Our model can, from scRNA-seq data, delineate biological meaningful modules that govern a dataset, as well as give information as to which modules are active in each single cell. Importantly, most of these modules can be explained by known biological functions, as provided by the Hallmark gene sets. Conclusions: We discover that tailored training of an autoencoder makes it possible to deconvolute biological modules inherent in the data, without any assumptions. By comparisons with gene signatures of canonical pathways we see that the modules are directly interpretable. The scope of this discovery has important implications, as it makes it possible to outline the drivers behind a given effect of a cell. In comparison with other dimensionality reduction methods, or supervised models for classification, our approach has the benefit of both handling well the zero-inflated nature of scRNA-seq, and validating that the model captures relevant information, by establishing a link between input and decoded data. In perspective, our model in combination with clustering methods is able to provide information about which subtype a given single cell belongs to, as well as which biological functions determine that membership. Keywords: Interpretable machine learning, Deep learning, Neural networks, Manifold learning, Expression profiles, Single-cell RNA-sequencing, Gene set enrichment analysis, Functional analysis, Biological pathway analysis Background Recent upsurge of data generated by mRNA sequencing at the single cell level (scRNA-seq) have helped to address a number of scientific questions and have also revealed new challenges. It allows researchers to look into gene expression levels of a specific cell, rather than the aggregated levels that came with “bulk” RNA sequencing, and create fine molecular profiles of tissues, that * Correspondence: 1 Centre for Genomic Medicine Rigshospitalet, University of Copenhagen, Copenhagen, Denmark 4 University Children’s Hospital Basel and Department of Biomedicine, University of Basel, Basel, Switzerland Full list of author information is available at the end of the article are particularly important for insights into the dynamics and function of more heterogeneous tissues, such as cancer tissues. Using scRNA-seq it has been possible to delineate cellular populations in an unbiased manner from several healthy [1–4] and diseased tissue [5, 6], and a large number of new methods have addressed the new computational and analytical challenges with this data type [7–9]. Modeling of the scRNA-seq data is challenging because relevant and often categorical biological signal is usually intertwined with dynamical biological processes (i.e. cell cycle, maturation, differentiation or metabolic activity) as well as technical sources of variation (i.e. © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kinalis et al. BMC Bioinformatics (2019) 20:379 PCR amplification, “dropout” events, sequencing or library preparation variation tissue dissociation and many parameters related to laboratory protocol). Recently, there have been several excellent attempts to model scRNA-seq data using prior knowledge on specific sources of variation [10, 11]. In this study, however, our aim is to extract biological information from a class of more general, non-linear models, that can assimilate the information of the manifold shaped by the single-cell expression profiles. Artificial neural networks (NN) have proven flexible and demonstrated representational power and state of the art results in many applications (i.e. skin cancer classification [12], retinal disease diagnosis [13], protein folding [14, 15]). In addition, recent advancements in the development of software frameworks that efficiently exploit computing resources, mostly by parallel processing on GPU, render the definition, implementation and training of a NN quite straightforward. We hypothesise that simple NN layouts and stringent training will make deconvolution possible and tease apart biological signal from heterogeneous cellular populations. We believe that the distributed nature of NN models bears the potential of encapsulating, rather than smoothing over or regressing out sources of variation, both biological and technical. In this study we applied autoencoder neural networks [16], unsupervised machine learning methods, to scRNA-seq expression counts. This class of models are used as a manifold learning technique and are able to efficiently capture the underlying signal even when the input is perturbed or zeroed out [17], which is particularly appealing for an application to scRNA-seq data. Variants of autoencoders have been successfully applied to scRNA-seq data before, for dimensionality reduction, denoising and imputation of missing values (see [18–26] for a complete list of studies). Here, we will make use of a simple autoencoder architecture and apply methods from the computer graphics community, known as saliency maps [27], aiming to deconvolute what the latent representation of the model captures, and to interpret it in terms of biological pathways. Results A simple autoencoder with three layers (input layer, a hidden or representation layer and an output layer) can be seen on Fig. 1b. Each layer consists of a number of units, corresponding to its dimensionality. Briefly, an autoencoder is trained to learn how to recre (...truncated)