TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference
BIOINFORMATICS
ORIGINAL PAPER
Gene expression
Vol. 29 no. 18 2013, pages 2292–2299
doi:10.1093/bioinformatics/btt381
Advance Access publication July 2, 2013
TIGAR: transcript isoform abundance estimation method
with gapped alignment of RNA-Seq data by variational
Bayesian inference
Naoki Nariai1,*, Osamu Hirose2, Kaname Kojima1 and Masao Nagasaki1
1
Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, Seiryo-machi,
Aoba-ku, Sendai, Miyagi, 980-8575, Japan and 2Faculty of Electrical and Computer Engineering, Institute of Science
and Engineering, Kanazawa University, Kakuma, Kanazawa, Ishikawa, 920-1192, Japan
Associate Editor: Martin Bishop
ABSTRACT
Motivation: Many human genes express multiple transcript isoforms
through alternative splicing, which greatly increases diversity of protein
function. Although RNA sequencing (RNA-Seq) technologies have
been widely used in measuring amounts of transcribed mRNA, accurate estimation of transcript isoform abundances from RNA-Seq data is
challenging because reads often map to more than one transcript
isoforms or paralogs whose sequences are similar to each other.
Results: We propose a statistical method to estimate transcript isoform abundances from RNA-Seq data. Our method can handle
gapped alignments of reads against reference sequences so that it
allows insertion or deletion errors within reads. The proposed method
optimizes the number of transcript isoforms by variational Bayesian
inference through an iterative procedure, and its convergence is guaranteed under a stopping criterion. On simulated datasets, our method
outperformed the comparable quantification methods in inferring transcript isoform abundances, and at the same time its rate of convergence was faster than that of the expectation maximization algorithm.
We also applied our method to RNA-Seq data of human cell line samples, and showed that our prediction result was more consistent
among technical replicates than those of other methods.
Availability: An implementation of our method is available at http://
github.com/nariai/tigar
Contact:
Supplementary Information: Supplementary data are available at
Bioinformatics online.
Received on February 14, 2013; revised on June 11, 2013; accepted
on June 27, 2013
1
INTRODUCTION
Alternative splicing is a biological process in which an exon can
be either included or excluded, or there can be several splice sites
so that it allows a gene to have multiple forms of proteins. A
recent study suggested that 490% of human genes undergo alternative splicing, and that many alternative splicing events vary
among tissues (Wang et al., 2008). It has also been reported that
some aberrant splicing results in human diseases, such as cystic
fibrosis and spinal muscular atrophy (Garcia-Blanco et al.,
2004). Hence, it is important to identify transcript isoforms
*To whom correspondence should be addressed.
2292
that are expressed in particular tissues under some conditions,
and not in others.
To quantify transcription levels, RNA-Seq that uses highthroughput sequencing of cDNA has been widely used.
Advantages of RNA-Seq over other conventional methods,
such as DNA microarrays, include precise measurements of
levels of transcripts and splice patterns because the sequencebased approach directly determines mRNA sequences (Wang
et al., 2009). Probabilistic approaches have been proposed to
compute the relative frequencies of alternative splice isoforms
for each gene, but these methods do not handle reads that map
to multiple gene loci, and hence do not estimate global transcript
abundances for each isoform (Jiang and Wong, 2009; Katz et al.,
2010). In most cases, reads are often shorter than a full transcript, and sequences of alternatively spliced transcript isoforms
and paralogs are often similar to each other (Mortazavi et al.,
2008). Hence, short reads generated by RNA-Seq do not always
map uniquely to reference cDNAs, which lead to inaccurate
estimation of transcript isoform abundances.
Several approaches have been proposed to resolve ambiguous
reads that map to multiple transcript isoforms (multi-map reads).
One ad hoc approach is to allocate fractions of multimapped
reads to target transcript isoforms equally, which is implemented
as the default option in Cufflinks (Trapnell et al., 2010). Another
approach is to allocate fractions of the reads in proportion to the
coverage of uniquely mapped reads divided by the length of the
transcript isoforms (‘rescue’ method) (Mortazavi et al., 2008),
which is implemented as the ‘-u’ option in the latest version of
Cufflinks. Statistical methods that use a generative model of
RNA-Seq data have been proposed, in which the transcript isoform abundance is estimated as a latent random variable by the
expectation maximization (EM) algorithm (Li and Dewey, 2011;
Li et al., 2010; Nicolae et al., 2011). Although Li et al. showed
that the statistical methods performed better than the rescue
method (Mortazavi et al., 2008), their methods cannot handle
the gapped alignment of reads, i.e. an alignment with insertions
into reference sequences or deletions from reference sequences.
When reads generated from high-throughput sequencers become
longer with some insertion or deletion errors, it will be more
difficult to map the reads to reference sequences without allowing gapped alignment. Hence, the limitation of methods that are
unable to handle gapped alignments of reads is a huge drawback
when more sophisticated mapping tools such as Bowtie2
ß The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail:
TIGAR by variational Bayesian inference
(Trapnell et al., 2012) or Novoalign (www.novocraft.com) are
considered.
Also, the EM algorithm used in these statistical methods tends
to overfit the model parameters to the RNA-Seq data, i.e. the
methods try to estimate as many transcript isoforms as possible,
as long as the likelihood function increases. It is problematic, for
example, when there are many transcript isoforms whose sequences are similar to the one that is actually expressed, and
the read sequences contain base substitution, insertion and deletion errors. As a result of the maximum likelihood estimation,
the method might predict false positives (‘spurious’ transcript
isoforms), and hence, the transcript abundances estimated for
true isoforms are affected. One strategy to avoid such overfitting
is to consider a model selection framework based on the marginal likelihood (Cooper and Herskovits, 1992), where a tradeoff of matching of sequence reads and the number of isoforms
can be considered. However, this involves integrating over all
latent variables as well as model parameters, which is usually
difficult and intractable to compute.
In this article, we propose a statistical method named TIGAR,
a transcript isoform abundance estimation method with gapped
alignment of RNA-Seq data by variational Bayesian (VB) (...truncated)