Bayesian nonparametric discovery of isoforms and individual specific quantification

Nature Communications, Apr 2018

Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop biisq, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. biisq does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. biisq shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://www.nature.com/articles/s41467-018-03402-w.pdf

Bayesian nonparametric discovery of isoforms and individual specific quantification

Abstract Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop biisq, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. biisq does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. biisq shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios. Introduction Alternative splicing is the process by which a single gene produces distinct mRNA isoforms, which vary in usage of component exons1. Isoforms can differ by alternative transcription initiation sites, alternative usage of splice sites (either 5′ donor or 3′ acceptor sites), alternative polyadenylation sites, or variable inclusion of entire exons or introns (Fig. 1). Altogether, alternative splicing enables the large diversity of mRNA expression levels and proteome composition observed in eukaryotic cells, which is particularly important for regulating the context-specific needs of the cell2. Fig. 1 Alternative splicing mechanisms. A single gene may be transcribed into several distinct mRNA variants called isoforms through alternative splicing mechanisms. This figure shows six common types of splicing events (top to bottom): simple transcript; alternative transcription start site; alternative 5′ splice site; alternative 3′ splice site; skipped exon; and alternative polyadenylation Full size image It is estimated that 95% of human protein-coding genes can be alternatively spliced1. These splicing decisions are important drivers of many biological processes, with considerable variation in splicing patterns across human tissues3. For example, mutations in splicing regulatory elements may lead to disease pathogenesis and progression,1, 4,5,6,7,8 and mutations in protein domains of specific splicing factors occur at a high rate in tumor cells, resulting in increased cellular proliferation9. Furthermore, proteins resulting from splicing variants often have distinct molecular functions. For instance, the two variants of survivin have opposite functions: one with pro-apoptotic and the other with anti-apoptotic properties10. Although there is increasing evidence of the biological importance of splicing processes, the precise role of alternative isoforms in regulating complex phenotypes is still largely uncharacterized. This gap in understanding is due, in part, to the difficulty of identifying and quantifying isoforms with high accuracy from short-read RNA-seq data11. Transcript reconstruction is essential to elucidate the role of gene expression in biological processes because gene-level quantification is convoluted by the multiple transcribed isoforms for each gene. The difficulties in isoform quantification stem from the tissue- and sample-specific composition and expression patterns of isoforms, the lack of a complete reference for isoform composition, and low abundance levels of many isoforms2. Further, RNA-seq reads that overlap informative splice junctions are rare, often noisy12, and difficult to map to a reference genome13. Improvements in reconstructing and quantifying tissue- and sample-specific isoforms would enable substantial improvements in understanding the role of alternative splicing in complex disease. While many tools exist for isoform reconstruction using RNA-seq data, these methods have a number of drawbacks. First, many quantification methods assume that a high-resolution isoform sequence reference is available for each gene in the genome14,15,16; in practice these references are often not available or incomplete for non-model organisms and rare tissue or disease samples11. Second, while a few methods process multiple samples simultaneously17,18,19, most methods consider a single sample in isolation, which fails to exploit the sharing of isoforms across samples to gain power for identification of rare or low abundance isoforms20,21,22. Third, many methods make technology-dependent assumptions by controlling for specific biases (e.g., non-uniform sampling of reads23) that do not generalize to mixtures of existing technologies or new technologies with different biases. Our metho (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41467-018-03402-w.pdf

Derek Aguiar, Li-Fang Cheng, Bianca Dumitrascu, Fantine Mordelet, Athma A. Pai, Barbara E. Engelhardt. Bayesian nonparametric discovery of isoforms and individual specific quantification, Nature Communications, 2018, Issue: 9, DOI: 10.1038/s41467-018-03402-w