EPEPT: A web service for enhanced P-value estimation in permutation tests
BMC Bioinformatics
A web service for enhanced
Theo A Knijnenburg 0 1
Jake Lin 1
Hector Rovira 1
John Boyle 1
Ilya Shmulevich 1
0 Current Address: Bioinformatics and Statistics, Division of Molecular Biology, Netherlands Cancer Institute , Amsterdam , The Netherlands
1 Institute for Systems Biology , Seattle, WA , USA
P-value estimation in permutation tests EPEPT: A web service for enhanced P-value
Open Access
EPEPT: A web service for enhanced P-value
estimation in permutation tests
Background: In computational biology, permutation tests have become a widely used tool to assess the statistical
significance of an event under investigation. However, the common way of computing the P-value, which
expresses the statistical significance, requires a very large number of permutations when small (and thus
interesting) P-values are to be accurately estimated. This is computationally expensive and often infeasible.
Recently, we proposed an alternative estimator, which requires far fewer permutations compared to the standard
empirical approach while still reliably estimating small P-values [1].
Results: The proposed P-value estimator has been enriched with additional functionalities and is made available to
the general community through a public website and web service, called EPEPT. This means that the EPEPT
routines can be accessed not only via a website, but also programmatically using any programming language that
can interact with the web. Examples of web service clients in multiple programming languages can be
downloaded. Additionally, EPEPT accepts data of various common experiment types used in computational
biology. For these experiment types EPEPT first computes the permutation values and then performs the P-value
estimation. Finally, the source code of EPEPT can be downloaded.
Conclusions: Different types of users, such as biologists, bioinformaticians and software engineers, can use the
method in an appropriate and simple way.
Availability: http://informatics.systemsbiology.net/EPEPT/
Background
The permutation test (also called randomization test) is a
nonparametric procedure for determining statistical
significance based on rearrangements of the labels of a dataset
[2]. Due to its non-parametric nature, this test is
commonly used in bioinformatics applications, where there is
often no solid evidence or sufficient data to assume a
particular model for the obtained measurements of the
biological events under investigation. For example, Significance
Analysis of Microarrays (SAM) [3] and Gene Set
Enrichment Analysis (GSEA) [4], which detect differentially
expressed genes and gene sets, respectively, are two
wellknown techniques that use permutation tests to compute
statistical significance.
In a permutation test, a test statistic, which is computed
from the dataset, is compared with the distribution of
permutation values. These permutation values are computed
* Correspondence:
1Institute for Systems Biology, Seattle, WA, USA
Full list of author information is available at the end of the article
similarly to the test statistic, but under a random
rearrangement (permutation) of the labels of the dataset. The
P-value of a permutation test, which expresses its
statistical significance, is obtained by performing all possible
label permutations and computing the fraction of
permutation values that are at least as extreme as the test statistic
obtained from the unpermuted data. However, in practical
situations, it is (by far) not feasible to perform all possible
permutations. Thus, the P-value is typically approximated
by computing a limited number of permutations, say N,
and then computing the fraction of the N permutation
values that are at least as extreme as the test statistic. This
empirical approximation to compute the P-value directly
couples both the minimal obtainable P-value and the
resolution of the P-value to the number of permutations.
Therefore, it requires a very large number of permutations
when small P-values are to be accurately estimated. To
improve upon the empirical estimator, we have employed
a tail estimation procedure based on extreme value theory
to estimate the tail of the distribution of permutation
values and subsequently the P-value [1]. We showed using
both theoretical and practical examples that up to several
orders of magnitude fewer permutations are necessary to
compute small P-values with the same accuracy as with
the empirical approach. This results in an enormous gain
in terms of computation time. For realistic datasets using
the standard number of 1000 permutations, this speed-up
will lead to a decrease in CPU time on the order of a
couple of minutes to several hours for more complex statistics
(like GSEAs running sum statistic). The approach is
outlined in Figure 1 and described in detail in [1].
The aim of EPEPT is to make this approach available to
the computational biology community as a general and
easily accessible tool. EPEPT, which stands for Enhanced
P-value Estimator for Permutation Tests, is a RESTful
web API that offers dynamic programmatic access. Users
submit job requests over the web either using their
programming language of choice or using the website.
EPEPT returns a unique URI corresponding to the
submitted job. Using this URI the status of the submitted job
can be checked, and upon completion, the results, i.e. the
estimated P-values, can be retrieved.
EPEPT can be used in two different settings. In the first
and most general setting, the user submits permutation
values and EPEPT estimates the P-values, i.e. EPEPT
does not generate the permutation statistics. In the
Pperm =
Pecdf =
Pgpd =
nN=al1l I(xn x0)
(1 F (x0 t))
Pperm
Pgpd
10
e
u
l
a
v
P 3
10
Figure 1 The P-value of a permutation test as a function of test statistic x0. Pperm is the correct P-value of the permutation test based on
all possible label permutations (Nall = 105 in this example). The Nall permutation values are visualized as gray crosses on the x-axis. Pecdf is the
standard empirical estimator of the P-value based on a limited set of N permutation values (N = 103 in this example). These are visualized as
blue plus signs on the x-axis. Pgpd is the P-value estimator described in [1], which is also based on the N permutation values. It uses the
extreme permutation values, which exceed a particular threshold t. These Nexc permutation values are called the exceedances and are visualized
by the red circles added to the blue plus signs. In this example t = 5. The exceedances are used to estimate the tail of the distribution of
permutation values as a generalized Pareto distribution (GPD). The GPD is represented by function F in the Pgpd equation. From this figure it is
clear that Pecdf is a poor estimator of small P-values, the minimum obtainable P-value being 1/N. In general, Pecdf requires 10/P permutations for
a good estimate, P being the correct P-value. Pgpd, on the other hand, provides an accurate estimate of the correct P-value, even (...truncated)