Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Bioinformatics, Dec 2015

Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy. Results: We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development. Availability and implementation: Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2 Supplementary information: Supplementary data are available at Bioinformatics online. Contact: hayer{at}upenn.edu

Article PDF cannot be displayed. You can download it here:

https://bioinformatics.oxfordjournals.org/content/31/24/3938.full.pdf

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Bioinformatics, 31(24), 2015, 3938–3945 doi: 10.1093/bioinformatics/btv488 Advance Access Publication Date: 3 September 2015 Original Paper Gene expression Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data Katharina E. Hayer1,*, Angel Pizarro2, Nicholas F. Lahens3, John B. Hogenesch3 and Gregory R. Grant1,4 1 University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104, Scientific Computing at Amazon Web Services, Seattle, WA 98108, 3Department of Pharmacology and 4 Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA 2 *To whom correspondence should be addressed. Associate Editor: Alfonso Valencia Received on August 6, 2014; revised on July 28, 2015; accepted on August 17, 2015 Abstract Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy. Results: We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development. Availability and implementation: Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2 Supplementary information: Supplementary data are available at Bioinformatics online. Contact: 1 Introduction We first fix some terminology. For our purposes a gene is a collection of transcripts, also called splice forms. A transcript is a collection of exons. An exon is a contiguous span of genomic coordinates. Two splice forms of the same gene can, and usually do, have some of the same exons. Exons which overlap but have different start and/ or end location are also possible. Two different splice forms may share all of their exons, as long as at least one of them differs by their start/end coordinates. Typically for any given gene in any given cell, some of its splice forms are expressed and others are absent. C The Author 2015. Published by Oxford University Press. V One of the primary goals of high throughput RNA-Sequencing (RNA-Seq) is to accurately identify the full-length structure of the transcripts that are present, and their relative abundances, so that researchers can focus on the most relevant splice forms in their system of interest. This is a very difficult problem however. Most human and mouse genes have many exons (Fig. 1, left) and are annotated with multiple splice forms. We observe that 35% of genes may express at least two forms under normal conditions (see Fig. 1, right). As methods improve and the number of tissues that are deep sequenced increases, the annotation will only get more complex. 3938 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Benchmark analysis of algorithms Fig. 1. Left: Shows number of mouse mm9 ENSEMBL transcripts as a function of the number of exons. 90% of transcripts have multiple exons. 65% have >5 and 35% have >10. Right: Distribution of the minimum number of splice forms necessary to explain the RNA-Seq junctions in 300 M read pairs of mouse Liver (Zhang et al., 2014). This is based on the first 200 RefSeq genes annotated on Chromosome 1 Moreover, for species without genome sequence available, de novo RNA-Seq and transcript assembly should provide an efficient means of gene discovery. By visually inspecting depth-of-coverage plots and representations of spliced reads in a genome browser, it is easy to see that a majority of genes are tractable and the expressed splice forms can be determined (Fig. 2A). Here the data can be explained completely by the one annotated splice form. Blue junctions indicate the existence of reads which mapped cleanly across an annotated splice junction (‘clean’ here means uniquely and with at least eight bases on each side). Note that the blue directional spans above the coverage plot indicate single reads (not read-pairs) spliced across intron-sized gaps. In contrast, Figure 2B shows a region with a significant number of reads spliced from one gene to another, as well as entirely unannotated genes. Green junctions indicate the existence of reads which mapped cleanly across unannotated junctions. A number of Fig. 3. This shows the depth of coverage of a full-length cDNA clone, which has been transcribed and subjected to the Ribo-Zero (red) and PolyA selection (orange) protocols for removal of ribosomal RNA. Both protocols result in extreme local bias (Lahens et al., 2014). PolyA causes 30 bias (note this gene is oriented on the reverse strand) such genes and regions as shown in Figure 2B typically occur in every RNA-Seq experiment and present the challenging cases for transcript level analysis. The nature of current RNA-Seq, resulting in short error prone reads, alignment artifacts, bias introduced by the library construction process etc., introduces noise into this already difficult problem. For example the ribosomal depletion protocols invariably cause a severe deviation of signal from uniform across a single transcript. Figure 3 shows the result of sequencing a single full length cDNA clone after reverse transcription followed by Ribo-Zero (top) and polyA selection (bottom) (Lahens et al. 2014). In both cases the deviation from uniform is marked. At the present time there is no known protocol that doesn’t suffer from this kind of issue. Most algorithms assume at least reasonably uniform coverage across each expressed transcript. Despite these challenges, many algorithms have been developed to infer full-length transcripts de novo from short read RNA-Seq data, of which Cufflinks (Trapnell et al., 2010) is the most widely used. Other methods are StringTie (Pertea et al. 2015), Scripture (Guttman et al., 2010), Trinity (Grabherr et al., 2011), Oases (Velvet) (Schultz et al., 2012), Soap-denovo-trans (Xie et al., 2014), iReckon (Mezlini et al., 2013), CLASS (Song et a (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/31/24/3938.full.pdf
Article home page: http://bioinformatics.oxfordjournals.org/content/31/24/3938.abstract

Katharina E. Hayer, Angel Pizarro, Nicholas F. Lahens, John B. Hogenesch, Gregory R. Grant. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data, Bioinformatics, 2015, pp. 3938-3945, 31/24, DOI: 10.1093/bioinformatics/btv488