EPEPT: A web service for enhanced P-value estimation in permutation tests (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2105-12-411.pdf

EPEPT: A web service for enhanced P-value estimation in permutation tests

BMC Bioinformatics A web service for enhanced Theo A Knijnenburg 0 1 Jake Lin 1 Hector Rovira 1 John Boyle 1 Ilya Shmulevich 1 0 Current Address: Bioinformatics and Statistics, Division of Molecular Biology, Netherlands Cancer Institute , Amsterdam , The Netherlands 1 Institute for Systems Biology , Seattle, WA , USA P-value estimation in permutation tests EPEPT: A web service for enhanced P-value Open Access EPEPT: A web service for enhanced P-value estimation in permutation tests Background: In computational biology, permutation tests have become a widely used tool to assess the statistical significance of an event under investigation. However, the common way of computing the P-value, which expresses the statistical significance, requires a very large number of permutations when small (and thus interesting) P-values are to be accurately estimated. This is computationally expensive and often infeasible. Recently, we proposed an alternative estimator, which requires far fewer permutations compared to the standard empirical approach while still reliably estimating small P-values [1]. Results: The proposed P-value estimator has been enriched with additional functionalities and is made available to the general community through a public website and web service, called EPEPT. This means that the EPEPT routines can be accessed not only via a website, but also programmatically using any programming language that can interact with the web. Examples of web service clients in multiple programming languages can be downloaded. Additionally, EPEPT accepts data of various common experiment types used in computational biology. For these experiment types EPEPT first computes the permutation values and then performs the P-value estimation. Finally, the source code of EPEPT can be downloaded. Conclusions: Different types of users, such as biologists, bioinformaticians and software engineers, can use the method in an appropriate and simple way. Availability: http://informatics.systemsbiology.net/EPEPT/ Background The permutation test (also called randomization test) is a nonparametric procedure for determining statistical significance based on rearrangements of the labels of a dataset [2]. Due to its non-parametric nature, this test is commonly used in bioinformatics applications, where there is often no solid evidence or sufficient data to assume a particular model for the obtained measurements of the biological events under investigation. For example, Significance Analysis of Microarrays (SAM) [3] and Gene Set Enrichment Analysis (GSEA) [4], which detect differentially expressed genes and gene sets, respectively, are two wellknown techniques that use permutation tests to compute statistical significance. In a permutation test, a test statistic, which is computed from the dataset, is compared with the distribution of permutation values. These permutation values are computed * Correspondence: 1Institute for Systems Biology, Seattle, WA, USA Full list of author information is available at the end of the article similarly to the test statistic, but under a random rearrangement (permutation) of the labels of the dataset. The P-value of a permutation test, which expresses its statistical significance, is obtained by performing all possible label permutations and computing the fraction of permutation values that are at least as extreme as the test statistic obtained from the unpermuted data. However, in practical situations, it is (by far) not feasible to perform all possible permutations. Thus, the P-value is typically approximated by computing a limited number of permutations, say N, and then computing the fraction of the N permutation values that are at least as extreme as the test statistic. This empirical approximation to compute the P-value directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Therefore, it requires a very large number of permutations when small P-values are to be accurately estimated. To improve upon the empirical estimator, we have employed a tail estimation procedure based on extreme value theory to estimate the tail of the distribution of permutation values and subsequently the P-value [1]. We showed using both theoretical and practical examples that up to several orders of magnitude fewer permutations are necessary to compute small P-values with the same accuracy as with the empirical approach. This results in an enormous gain in terms of computation time. For realistic datasets using the standard number of 1000 permutations, this speed-up will lead to a decrease in CPU time on the order of a couple of minutes to several hours for more complex statistics (like GSEAs running sum statistic). The approach is outlined in Figure 1 and described in detail in [1]. The aim of EPEPT is to make this approach available to the computational biology community as a general and easily accessible tool. EPEPT, which stands for Enhanced P-value Estimator for Permutation Tests, is a RESTful web API that offers dynamic programmatic access. Users submit job requests over the web either using their programming language of choice or using the website. EPEPT returns a unique URI corresponding to the submitted job. Using this URI the status of the submitted job can be checked, and upon completion, the results, i.e. the estimated P-values, can be retrieved. EPEPT can be used in two different settings. In the first and most general setting, the user submits permutation values and EPEPT estimates the P-values, i.e. EPEPT does not generate the permutation statistics. In the Pperm = Pecdf = Pgpd = nN=al1l I(xn x0) (1 F (x0 t)) Pperm Pgpd 10 e u l a v P 3 10 Figure 1 The P-value of a permutation test as a function of test statistic x0. Pperm is the correct P-value of the permutation test based on all possible label permutations (Nall = 105 in this example). The Nall permutation values are visualized as gray crosses on the x-axis. Pecdf is the standard empirical estimator of the P-value based on a limited set of N permutation values (N = 103 in this example). These are visualized as blue plus signs on the x-axis. Pgpd is the P-value estimator described in [1], which is also based on the N permutation values. It uses the extreme permutation values, which exceed a particular threshold t. These Nexc permutation values are called the exceedances and are visualized by the red circles added to the blue plus signs. In this example t = 5. The exceedances are used to estimate the tail of the distribution of permutation values as a generalized Pareto distribution (GPD). The GPD is represented by function F in the Pgpd equation. From this figure it is clear that Pecdf is a poor estimator of small P-values, the minimum obtainable P-value being 1/N. In general, Pecdf requires 10/P permutations for a good estimate, P being the correct P-value. Pgpd, on the other hand, provides an accurate estimate of the correct P-value, even (...truncated)