TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference (pdf)

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/29/18/2292.full.pdf

TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference

BIOINFORMATICS ORIGINAL PAPER Gene expression Vol. 29 no. 18 2013, pages 2292–2299 doi:10.1093/bioinformatics/btt381 Advance Access publication July 2, 2013 TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference Naoki Nariai1,*, Osamu Hirose2, Kaname Kojima1 and Masao Nagasaki1 1 Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan and 2Faculty of Electrical and Computer Engineering, Institute of Science and Engineering, Kanazawa University, Kakuma, Kanazawa, Ishikawa, 920-1192, Japan Associate Editor: Martin Bishop ABSTRACT Motivation: Many human genes express multiple transcript isoforms through alternative splicing, which greatly increases diversity of protein function. Although RNA sequencing (RNA-Seq) technologies have been widely used in measuring amounts of transcribed mRNA, accurate estimation of transcript isoform abundances from RNA-Seq data is challenging because reads often map to more than one transcript isoforms or paralogs whose sequences are similar to each other. Results: We propose a statistical method to estimate transcript isoform abundances from RNA-Seq data. Our method can handle gapped alignments of reads against reference sequences so that it allows insertion or deletion errors within reads. The proposed method optimizes the number of transcript isoforms by variational Bayesian inference through an iterative procedure, and its convergence is guaranteed under a stopping criterion. On simulated datasets, our method outperformed the comparable quantification methods in inferring transcript isoform abundances, and at the same time its rate of convergence was faster than that of the expectation maximization algorithm. We also applied our method to RNA-Seq data of human cell line samples, and showed that our prediction result was more consistent among technical replicates than those of other methods. Availability: An implementation of our method is available at http:// github.com/nariai/tigar Contact: Supplementary Information: Supplementary data are available at Bioinformatics online. Received on February 14, 2013; revised on June 11, 2013; accepted on June 27, 2013 1 INTRODUCTION Alternative splicing is a biological process in which an exon can be either included or excluded, or there can be several splice sites so that it allows a gene to have multiple forms of proteins. A recent study suggested that 490% of human genes undergo alternative splicing, and that many alternative splicing events vary among tissues (Wang et al., 2008). It has also been reported that some aberrant splicing results in human diseases, such as cystic fibrosis and spinal muscular atrophy (Garcia-Blanco et al., 2004). Hence, it is important to identify transcript isoforms *To whom correspondence should be addressed. 2292 that are expressed in particular tissues under some conditions, and not in others. To quantify transcription levels, RNA-Seq that uses highthroughput sequencing of cDNA has been widely used. Advantages of RNA-Seq over other conventional methods, such as DNA microarrays, include precise measurements of levels of transcripts and splice patterns because the sequencebased approach directly determines mRNA sequences (Wang et al., 2009). Probabilistic approaches have been proposed to compute the relative frequencies of alternative splice isoforms for each gene, but these methods do not handle reads that map to multiple gene loci, and hence do not estimate global transcript abundances for each isoform (Jiang and Wong, 2009; Katz et al., 2010). In most cases, reads are often shorter than a full transcript, and sequences of alternatively spliced transcript isoforms and paralogs are often similar to each other (Mortazavi et al., 2008). Hence, short reads generated by RNA-Seq do not always map uniquely to reference cDNAs, which lead to inaccurate estimation of transcript isoform abundances. Several approaches have been proposed to resolve ambiguous reads that map to multiple transcript isoforms (multi-map reads). One ad hoc approach is to allocate fractions of multimapped reads to target transcript isoforms equally, which is implemented as the default option in Cufflinks (Trapnell et al., 2010). Another approach is to allocate fractions of the reads in proportion to the coverage of uniquely mapped reads divided by the length of the transcript isoforms (‘rescue’ method) (Mortazavi et al., 2008), which is implemented as the ‘-u’ option in the latest version of Cufflinks. Statistical methods that use a generative model of RNA-Seq data have been proposed, in which the transcript isoform abundance is estimated as a latent random variable by the expectation maximization (EM) algorithm (Li and Dewey, 2011; Li et al., 2010; Nicolae et al., 2011). Although Li et al. showed that the statistical methods performed better than the rescue method (Mortazavi et al., 2008), their methods cannot handle the gapped alignment of reads, i.e. an alignment with insertions into reference sequences or deletions from reference sequences. When reads generated from high-throughput sequencers become longer with some insertion or deletion errors, it will be more difficult to map the reads to reference sequences without allowing gapped alignment. Hence, the limitation of methods that are unable to handle gapped alignments of reads is a huge drawback when more sophisticated mapping tools such as Bowtie2 ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: TIGAR by variational Bayesian inference (Trapnell et al., 2012) or Novoalign (www.novocraft.com) are considered. Also, the EM algorithm used in these statistical methods tends to overfit the model parameters to the RNA-Seq data, i.e. the methods try to estimate as many transcript isoforms as possible, as long as the likelihood function increases. It is problematic, for example, when there are many transcript isoforms whose sequences are similar to the one that is actually expressed, and the read sequences contain base substitution, insertion and deletion errors. As a result of the maximum likelihood estimation, the method might predict false positives (‘spurious’ transcript isoforms), and hence, the transcript abundances estimated for true isoforms are affected. One strategy to avoid such overfitting is to consider a model selection framework based on the marginal likelihood (Cooper and Herskovits, 1992), where a tradeoff of matching of sequence reads and the number of isoforms can be considered. However, this involves integrating over all latent variables as well as model parameters, which is usually difficult and intractable to compute. In this article, we propose a statistical method named TIGAR, a transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian (VB) (...truncated)