Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data
Bioinformatics, 31(24), 2015, 3938–3945
doi: 10.1093/bioinformatics/btv488
Advance Access Publication Date: 3 September 2015
Original Paper
Gene expression
Benchmark analysis of algorithms for
determining and quantifying full-length mRNA
splice forms from RNA-seq data
Katharina E. Hayer1,*, Angel Pizarro2, Nicholas F. Lahens3,
John B. Hogenesch3 and Gregory R. Grant1,4
1
University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104,
Scientific Computing at Amazon Web Services, Seattle, WA 98108, 3Department of Pharmacology and
4
Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
2
*To whom correspondence should be addressed.
Associate Editor: Alfonso Valencia
Received on August 6, 2014; revised on July 28, 2015; accepted on August 17, 2015
Abstract
Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is
expected to be able to provide accurate identification and quantification of full-length splice forms.
A number of informatics packages have been developed for this purpose, but short reads make it a
difficult problem in principle. Sequencing error and polymorphisms add further complications. It
has become necessary to perform studies to determine which algorithms perform best and which
if any algorithms perform adequately. However, there is a dearth of independent and unbiased
benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy.
Results: We conclude that most methods are inaccurate even using idealized data, and that no
method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing
errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development.
Availability and implementation: Simulated datasets and other supporting information can be
found at http://bioinf.itmat.upenn.edu/BEERS/bp2
Supplementary information: Supplementary data are available at Bioinformatics online.
Contact:
1 Introduction
We first fix some terminology. For our purposes a gene is a collection of transcripts, also called splice forms. A transcript is a collection of exons. An exon is a contiguous span of genomic coordinates.
Two splice forms of the same gene can, and usually do, have some
of the same exons. Exons which overlap but have different start and/
or end location are also possible. Two different splice forms may
share all of their exons, as long as at least one of them differs by
their start/end coordinates. Typically for any given gene in any given
cell, some of its splice forms are expressed and others are absent.
C The Author 2015. Published by Oxford University Press.
V
One of the primary goals of high throughput RNA-Sequencing
(RNA-Seq) is to accurately identify the full-length structure of the
transcripts that are present, and their relative abundances, so that researchers can focus on the most relevant splice forms in their system
of interest. This is a very difficult problem however. Most human
and mouse genes have many exons (Fig. 1, left) and are annotated
with multiple splice forms. We observe that 35% of genes may express at least two forms under normal conditions (see Fig. 1, right).
As methods improve and the number of tissues that are deep
sequenced increases, the annotation will only get more complex.
3938
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Benchmark analysis of algorithms
Fig. 1. Left: Shows number of mouse mm9 ENSEMBL transcripts as a function of the number of exons. 90% of transcripts have multiple exons. 65%
have >5 and 35% have >10. Right: Distribution of the minimum number of
splice forms necessary to explain the RNA-Seq junctions in 300 M read pairs
of mouse Liver (Zhang et al., 2014). This is based on the first 200 RefSeq
genes annotated on Chromosome 1
Moreover, for species without genome sequence available, de novo
RNA-Seq and transcript assembly should provide an efficient means
of gene discovery.
By visually inspecting depth-of-coverage plots and representations of spliced reads in a genome browser, it is easy to see that a
majority of genes are tractable and the expressed splice forms can be
determined (Fig. 2A). Here the data can be explained completely by
the one annotated splice form. Blue junctions indicate the existence
of reads which mapped cleanly across an annotated splice junction
(‘clean’ here means uniquely and with at least eight bases on each
side). Note that the blue directional spans above the coverage plot
indicate single reads (not read-pairs) spliced across intron-sized
gaps. In contrast, Figure 2B shows a region with a significant number of reads spliced from one gene to another, as well as entirely
unannotated genes. Green junctions indicate the existence of reads
which mapped cleanly across unannotated junctions. A number of
Fig. 3. This shows the depth of coverage of a full-length cDNA clone, which
has been transcribed and subjected to the Ribo-Zero (red) and PolyA selection (orange) protocols for removal of ribosomal RNA. Both protocols result
in extreme local bias (Lahens et al., 2014). PolyA causes 30 bias (note this
gene is oriented on the reverse strand)
such genes and regions as shown in Figure 2B typically occur in
every RNA-Seq experiment and present the challenging cases for
transcript level analysis. The nature of current RNA-Seq, resulting
in short error prone reads, alignment artifacts, bias introduced by
the library construction process etc., introduces noise into this already difficult problem. For example the ribosomal depletion protocols invariably cause a severe deviation of signal from uniform
across a single transcript. Figure 3 shows the result of sequencing a
single full length cDNA clone after reverse transcription followed by
Ribo-Zero (top) and polyA selection (bottom) (Lahens et al. 2014).
In both cases the deviation from uniform is marked. At the present
time there is no known protocol that doesn’t suffer from this kind of
issue. Most algorithms assume at least reasonably uniform coverage
across each expressed transcript.
Despite these challenges, many algorithms have been developed
to infer full-length transcripts de novo from short read RNA-Seq
data, of which Cufflinks (Trapnell et al., 2010) is the most widely
used. Other methods are StringTie (Pertea et al. 2015), Scripture
(Guttman et al., 2010), Trinity (Grabherr et al., 2011), Oases
(Velvet) (Schultz et al., 2012), Soap-denovo-trans (Xie et al., 2014),
iReckon (Mezlini et al., 2013), CLASS (Song et a (...truncated)