GPrank: an R package for detecting dynamic elements from genome-wide time series
Topa and Honkela BMC Bioinformatics (2018) 19:367
https://doi.org/10.1186/s12859-018-2370-4
S O FT W A R E
Open Access
GPrank: an R package for detecting
dynamic elements from genome-wide time
series
Hande Topa1,2*
and Antti Honkela3,4
Abstract
Background: Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for
monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression
with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic
variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time
series data sets often consist of a very limited number of time points with very few or no biological replicates, posing
challenges in the data analysis.
Results: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance
information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a
beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled
time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent
and time-independent GP models, and comparing the evidence provided by data under two models by computing
their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can
be identified.
Conclusions: Incorporating the variance information helps GPrank avoid false positives without compromising
computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of
temporally most active dynamic elements in the genome can provide a good starting point for further downstream
analyses for increasing our understanding of the studied processes.
Keywords: Gaussian process, High-throughput sequencing, Time series, Ranking, Bayes factor, Visualization, R
Background
Advances in high-throughput sequencing (HTS) technologies have facilitated carrying out genome-wide time
series experiments which contain more information on
the dynamics of biological processes than static experiments do. With these experiments, thousands or millions
of genomic elements can be simultaneously measured at
a number of time points, allowing us to study the changes
in their abundances over time, and hence to model their
*Correspondence:
Institute for Molecular Medicine Finland FIMM, University of Helsinki, 00014
Helsinki, Finland
2
Helsinki Institute for Information Technology HIIT, Department of Computer
Science, Aalto University, 00076 Espoo, Finland
Full list of author information is available at the end of the article
1
responses to various external stimuli such as a drug treatment or a change in environment. Furthermore, detection
of temporally most active elements in the genomes, transcriptomes, or epigenomes of the organisms can lead to
a subset of genetic elements which are potentially biologically more relevant to the studied process than those
which stay unchanged. This subset of genetic elements
can then form a basis for further downstream analyses
to elucidate and validate their functions in the studied
processes.
On the other hand, despite the huge potential of HTS
time series experiments, analysis of the currently available
HTS time series data sets is complicated due to various
factors depending on the experimental design and the
properties of the HTS platforms used. First of all, these
© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Topa and Honkela BMC Bioinformatics (2018) 19:367
time series often consist of small number of time points
which are irregularly sampled, making the estimation of
the underlying temporal function challenging, and they
have too few biological replicates for accurate estimation of biological variance. Moreover, the properties of
the HTS platforms such as short read lengths and varying
sequencing depth levels lead to uncertain quantification
of the genetic elements.
Taking these characteristics of the data as well as the
sources of uncertainty into account in the downstream
analyses such as differential expression (DE) analyses is
very important for avoiding large numbers of false positives or false negatives. This becomes especially important
in large-scale studies like genome-wide experiments, as
finding differentially expressed genes among tens of thousands of genes requires robust statistical methods which
can differentiate true changes from changes occurring due
to noise.
Detection of differentially expressed genes from HTS
time series is handled in different ways by different methods. For example, some methods treat time points as independent factors and apply statistical hypothesis testing
to detect statistically significant changes in gene expression between different time points. For example, edgeR
[1], DESeq2 [2], limma-voom [3], next maSigPro [4] are
commonly used methods to detect DE between different time points by modelling RNA-seq read counts with
generalized linear models which treat the time points as
unordered factors.
Recently, methods which take into account the temporal correlation between observations in RNA-seq experiments have been developed by using hidden Markov models (HMMs) [5], cubic spline regression [6], and Gaussian
process (GP) regression [7–12].
Similarly, in population genetics, several methods taking
into account the temporal correlations between allele frequencies in successive generations have been developed
by using HMMs based on the Wright–Fisher model [13,
14], which usually assume a large population size and a
long time span. Recently developed CLEAR method [15]
improves the HMM models by making them applicable to
data sets obtained from small populations such as Pool-seq
time series in evolve and resequence (E&R) [16] studies.
GPs provide a powerful technique for modelling sparse
time series which are encountered frequently in genomic
studies where the number of replication and the length of
time series are limited by the experiment budget. However, most of the existing methods employing GPs for
HTS time series modelling are either not available as software, or the existing software such as DyNB [10] has been
implemented in Matlab, limiting the public accessibility of
the software.
In our e (...truncated)