Comprehensive evaluation of RNA-seq quantification methods for linearity (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1526-y

Comprehensive evaluation of RNA-seq quantification methods for linearity

The Author(s) BMC Bioinformatics 2017, 18(Suppl 4):117 DOI 10.1186/s12859-017-1526-y R ES EA R CH Open Access Comprehensive evaluation of RNA-seq quantification methods for linearity Haijing Jin1 , Ying-Wooi Wan2 and Zhandong Liu3* From Third International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC 2016) Seattle, WA, USA. 02-Oct-16 Abstract Background: Deconvolution is a mathematical process of resolving an observed function into its constituent elements. In the field of biomedical research, deconvolution analysis is applied to obtain single cell-type or tissue specific signatures from a mixed signal and most of them follow the linearity assumption. Although recent development of next generation sequencing technology suggests RNA-seq as a fast and accurate method for obtaining transcriptomic profiles, few studies have been conducted to investigate best RNA-seq quantification methods that yield the optimum linear space for deconvolution analysis. Results: Using a benchmark RNA-seq dataset, we investigated the linearity of abundance estimated from seven most popular RNA-seq quantification methods both at the gene and isoform levels. Linearity is evaluated through parameter estimation, concordance analysis and residual analysis based on a multiple linear regression model. Results show that count data gives poor parameter estimations, large intercepts and high inter-sample variability; while TPM value from Kallisto and Salmon shows high linearity in all analyses. Conclusions: Salmon and Kallisto TPM data gives the best fit to the linear model studied. This suggests that TPM values estimated from Salmon and Kallisto are the ideal RNA-seq measurements for deconvolution studies. Keywords: RNA-seq, Deconvolution, Linearity Background Next-generation sequencing based technology for RNA profiling (RNA-seq) has become the predominant method to quantify the transcript abundance in cells. Compared to microarray technology, RNA-seq offers broader quantification range and enables the detection of novel transcripts [1]. However, due to the fragmentation of sequencing material, there is greater complexity in quantification and analysis of RNA-seq data [2]. Current state-of-the-art quantification tools for RNA-seq data can be divided into two major categories [3]: alignment-based and alignmentfree. Alignment-based quantification methods will first map each sequenced reads to a reference genome or transcriptome and then estimate the abundance of transcripts *Correspondence: Department of Pediatrics-Neurology, Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine, 1250 Moursund St., Suite 1325, 77030 Houston, TX, USA Full list of author information is available at the end of the article 3 based on the alignment. Alignment-free quantification methods rely on light-weight pseudo-alignment in k-mer space to quantify the transcript abundance. An analytic challenge raised from these quantification methods is that different method generates abundance measurements in different units, including counts, FPKM (Fragments Per Kilobase of transcript per Million mapped reads), RPKM (Reads Per Kilobase of transcript per Million mapped reads), and TPM (Transcripts Per Million) [4]. Furthermore, various transformation strategies can be applied to quantification values in purpose of specific downstream analysis like differential gene expression analyses [5] or novel splicing site detection [6]. Although several studies have provided assessment of analysis tools for RNA-seq data, little consensus on the optimal analysis pipeline is obtained [4, 6–9]. © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. The Author(s) BMC Bioinformatics 2017, 18(Suppl 4):117 Deconvolution is a mathematical process used to extract constituent elements from a mixture of multiple signals [10]. In the field of biomedical research, deconvolution is widely applied to retrieve cell-type or tissue specific gene expression profiles from heterogeneous tissue samples. Most deconvolution algorithms in the literature assume a linear model [10–17], in which the expression signal of the mixture is a weighted sum of the expression for its constitutive cell types. Previous analysis has shown the necessity of using anti-log expression microarray data to avoid unwanted bias introduced by non-linear transformation [18]. However, no study has assessed the linearity of transcript abundance in RNA-seq data. Therefore, in this study, we conducted a comprehensive comparison of seven RNA-seq quantification methods on the linearity of the estimated abundance using a deep sequencing dataset where RNA samples were mixed at known proportions. Our results will provide a good recommendation to researchers considering deconvolution on RNA-seq data. Results Data We employed the benchmark dataset used to assess RNAseq measurement performance in different application sites and platforms from the Sequencing Quality Control (SEQC) project [19]. In order to have minimal intersample variability in the linearity evaluation analyses, we included samples from the same platform (Illumina HiSeq 2000) and same sequencing center (NVS). Specifically, raw sequenced reads for four biological replicates of four types of samples (A, B, C, D) were obtained; where sample A is derived from universal human reference RNA, sample B is derived from human brain reference RNA, sample C is obtained by mixing A and B in ratio 3:1, and sample D is obtained by mixing A and B in ratio 1:3. Out of 12 samples from A, B and C, nine samples have about eighty million pairs of raw reads and three samples have double the depth. Overall, the mappability of all the samples is around 70–80%. A brief summary about the samples is given in Additional file 6: Table S1. Quantification methods We performed a literature survey and selected seven prevalent quantification methods for comparison. To increase the comparability of the estimated transcript abundance, all the alignment-based quantification methods were applied on mappings processed with Tophat2 [20]. HTSeq-count [21] provides the number of reads/fragments mapped unambiguously to a single feature, referred as count. Cufflinks [22] , which is also the most popular quantification method, uses comparative algorithm assembly to produce minimal set of transcript supported by the transcript alignment. The resulting transcript ab (...truncated)