GPrank: an R package for detecting dynamic elements from genome-wide time series (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-018-2370-4

GPrank: an R package for detecting dynamic elements from genome-wide time series

Topa and Honkela BMC Bioinformatics (2018) 19:367 https://doi.org/10.1186/s12859-018-2370-4 S O FT W A R E Open Access GPrank: an R package for detecting dynamic elements from genome-wide time series Hande Topa1,2* and Antti Honkela3,4 Abstract Background: Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes. Keywords: Gaussian process, High-throughput sequencing, Time series, Ranking, Bayes factor, Visualization, R Background Advances in high-throughput sequencing (HTS) technologies have facilitated carrying out genome-wide time series experiments which contain more information on the dynamics of biological processes than static experiments do. With these experiments, thousands or millions of genomic elements can be simultaneously measured at a number of time points, allowing us to study the changes in their abundances over time, and hence to model their *Correspondence: Institute for Molecular Medicine Finland FIMM, University of Helsinki, 00014 Helsinki, Finland 2 Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, 00076 Espoo, Finland Full list of author information is available at the end of the article 1 responses to various external stimuli such as a drug treatment or a change in environment. Furthermore, detection of temporally most active elements in the genomes, transcriptomes, or epigenomes of the organisms can lead to a subset of genetic elements which are potentially biologically more relevant to the studied process than those which stay unchanged. This subset of genetic elements can then form a basis for further downstream analyses to elucidate and validate their functions in the studied processes. On the other hand, despite the huge potential of HTS time series experiments, analysis of the currently available HTS time series data sets is complicated due to various factors depending on the experimental design and the properties of the HTS platforms used. First of all, these © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Topa and Honkela BMC Bioinformatics (2018) 19:367 time series often consist of small number of time points which are irregularly sampled, making the estimation of the underlying temporal function challenging, and they have too few biological replicates for accurate estimation of biological variance. Moreover, the properties of the HTS platforms such as short read lengths and varying sequencing depth levels lead to uncertain quantification of the genetic elements. Taking these characteristics of the data as well as the sources of uncertainty into account in the downstream analyses such as differential expression (DE) analyses is very important for avoiding large numbers of false positives or false negatives. This becomes especially important in large-scale studies like genome-wide experiments, as finding differentially expressed genes among tens of thousands of genes requires robust statistical methods which can differentiate true changes from changes occurring due to noise. Detection of differentially expressed genes from HTS time series is handled in different ways by different methods. For example, some methods treat time points as independent factors and apply statistical hypothesis testing to detect statistically significant changes in gene expression between different time points. For example, edgeR [1], DESeq2 [2], limma-voom [3], next maSigPro [4] are commonly used methods to detect DE between different time points by modelling RNA-seq read counts with generalized linear models which treat the time points as unordered factors. Recently, methods which take into account the temporal correlation between observations in RNA-seq experiments have been developed by using hidden Markov models (HMMs) [5], cubic spline regression [6], and Gaussian process (GP) regression [7–12]. Similarly, in population genetics, several methods taking into account the temporal correlations between allele frequencies in successive generations have been developed by using HMMs based on the Wright–Fisher model [13, 14], which usually assume a large population size and a long time span. Recently developed CLEAR method [15] improves the HMM models by making them applicable to data sets obtained from small populations such as Pool-seq time series in evolve and resequence (E&R) [16] studies. GPs provide a powerful technique for modelling sparse time series which are encountered frequently in genomic studies where the number of replication and the length of time series are limited by the experiment budget. However, most of the existing methods employing GPs for HTS time series modelling are either not available as software, or the existing software such as DyNB [10] has been implemented in Matlab, limiting the public accessibility of the software. In our e (...truncated)