Comprehensive evaluation of RNA-seq quantification methods for linearity
The Author(s) BMC Bioinformatics 2017, 18(Suppl 4):117
DOI 10.1186/s12859-017-1526-y
R ES EA R CH
Open Access
Comprehensive evaluation of RNA-seq
quantification methods for linearity
Haijing Jin1 , Ying-Wooi Wan2 and Zhandong Liu3*
From Third International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC 2016)
Seattle, WA, USA. 02-Oct-16
Abstract
Background: Deconvolution is a mathematical process of resolving an observed function into its constituent
elements. In the field of biomedical research, deconvolution analysis is applied to obtain single cell-type or tissue
specific signatures from a mixed signal and most of them follow the linearity assumption. Although recent
development of next generation sequencing technology suggests RNA-seq as a fast and accurate method for
obtaining transcriptomic profiles, few studies have been conducted to investigate best RNA-seq quantification
methods that yield the optimum linear space for deconvolution analysis.
Results: Using a benchmark RNA-seq dataset, we investigated the linearity of abundance estimated from seven most
popular RNA-seq quantification methods both at the gene and isoform levels. Linearity is evaluated through
parameter estimation, concordance analysis and residual analysis based on a multiple linear regression model. Results
show that count data gives poor parameter estimations, large intercepts and high inter-sample variability; while TPM
value from Kallisto and Salmon shows high linearity in all analyses.
Conclusions: Salmon and Kallisto TPM data gives the best fit to the linear model studied. This suggests that TPM
values estimated from Salmon and Kallisto are the ideal RNA-seq measurements for deconvolution studies.
Keywords: RNA-seq, Deconvolution, Linearity
Background
Next-generation sequencing based technology for RNA
profiling (RNA-seq) has become the predominant method
to quantify the transcript abundance in cells. Compared to
microarray technology, RNA-seq offers broader quantification range and enables the detection of novel transcripts
[1]. However, due to the fragmentation of sequencing
material, there is greater complexity in quantification and
analysis of RNA-seq data [2]. Current state-of-the-art
quantification tools for RNA-seq data can be divided into
two major categories [3]: alignment-based and alignmentfree. Alignment-based quantification methods will first
map each sequenced reads to a reference genome or transcriptome and then estimate the abundance of transcripts
*Correspondence:
Department of Pediatrics-Neurology, Jan and Dan Duncan Neurological
Research Institute, Baylor College of Medicine, 1250 Moursund St., Suite 1325,
77030 Houston, TX, USA
Full list of author information is available at the end of the article
3
based on the alignment. Alignment-free quantification
methods rely on light-weight pseudo-alignment in k-mer
space to quantify the transcript abundance. An analytic
challenge raised from these quantification methods is that
different method generates abundance measurements in
different units, including counts, FPKM (Fragments Per
Kilobase of transcript per Million mapped reads), RPKM
(Reads Per Kilobase of transcript per Million mapped
reads), and TPM (Transcripts Per Million) [4]. Furthermore, various transformation strategies can be applied to
quantification values in purpose of specific downstream
analysis like differential gene expression analyses [5] or
novel splicing site detection [6]. Although several studies
have provided assessment of analysis tools for RNA-seq
data, little consensus on the optimal analysis pipeline is
obtained [4, 6–9].
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
The Author(s) BMC Bioinformatics 2017, 18(Suppl 4):117
Deconvolution is a mathematical process used to extract
constituent elements from a mixture of multiple signals
[10]. In the field of biomedical research, deconvolution is
widely applied to retrieve cell-type or tissue specific gene
expression profiles from heterogeneous tissue samples.
Most deconvolution algorithms in the literature assume
a linear model [10–17], in which the expression signal of
the mixture is a weighted sum of the expression for its
constitutive cell types. Previous analysis has shown the
necessity of using anti-log expression microarray data to
avoid unwanted bias introduced by non-linear transformation [18]. However, no study has assessed the linearity
of transcript abundance in RNA-seq data. Therefore, in
this study, we conducted a comprehensive comparison
of seven RNA-seq quantification methods on the linearity of the estimated abundance using a deep sequencing
dataset where RNA samples were mixed at known proportions. Our results will provide a good recommendation to
researchers considering deconvolution on RNA-seq data.
Results
Data
We employed the benchmark dataset used to assess RNAseq measurement performance in different application
sites and platforms from the Sequencing Quality Control
(SEQC) project [19]. In order to have minimal intersample variability in the linearity evaluation analyses, we
included samples from the same platform (Illumina HiSeq
2000) and same sequencing center (NVS). Specifically, raw
sequenced reads for four biological replicates of four types
of samples (A, B, C, D) were obtained; where sample A
is derived from universal human reference RNA, sample
B is derived from human brain reference RNA, sample C
is obtained by mixing A and B in ratio 3:1, and sample
D is obtained by mixing A and B in ratio 1:3. Out of 12
samples from A, B and C, nine samples have about eighty
million pairs of raw reads and three samples have double the depth. Overall, the mappability of all the samples
is around 70–80%. A brief summary about the samples is
given in Additional file 6: Table S1.
Quantification methods
We performed a literature survey and selected seven
prevalent quantification methods for comparison. To
increase the comparability of the estimated transcript
abundance, all the alignment-based quantification
methods were applied on mappings processed with
Tophat2 [20]. HTSeq-count [21] provides the number of
reads/fragments mapped unambiguously to a single feature, referred as count. Cufflinks [22] , which is also the
most popular quantification method, uses comparative
algorithm assembly to produce minimal set of transcript
supported by the transcript alignment. The resulting
transcript ab (...truncated)