Fewer permutations, more accurate P-values
Theo A. Knijnenburg
2
Lodewyk F. A. Wessels
1
Marcel J. T. Reinders
0
Ilya Shmulevich
2
0
Information and Communication Theory Group,
Delft University of Technology
, Delft,
The Netherlands
1
Bioinformatics and Statistics, The Netherlands Cancer Institute
,
Amsterdam
2
Institute for Systems Biology
,
Seattle, WA, USA
Motivation: Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible. Results: A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared with the standard empirical way of computing P-values. Availability: The Matlab code can be obtained from the corresponding author on request. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Permutation tests (also called randomization tests) are
nonparametric procedures for determining statistical significance based
on rearrangements of the labels of a dataset (Edgington, 1980). A test
statistic, which is computed from the dataset, is compared with the
distribution of permutation values. These permutation values are
computed similarly to the test statistic, however, under a random
rearrangement (permutation) of the labels of the dataset.
Permutation tests have become a widely used technique in
bioinformatics. The non-parametric nature of these tests rationalizes
their usability and popularity, since in many bioinformatics
applications there is no solid evidence or sufficient data to assume
a particular model for the obtained measurements of the biological
events under investigation.
For example, Significance Analysis of Microarrays (SAM)
(Tusher et al., 2001) and Gene Set Enrichment Analysis (GSEA)
(Subramanian et al., 2005), which detect differentially expressed
genes and gene sets, respectively, are two well-known techniques
that use permutation tests to compute statistical significance. In these
approaches, the class labels of samples from which gene expression
measurements are taken, are randomly rearranged to obtain the
permutation values. Besides randomization over the set of samples,
permutation tests have also been performed by randomizing over
the set of genes (Breitling et al., 2004; Smyth, 2004). In this case,
the labels are binary indicator variables that indicate whether a gene
belongs to a particular gene set or not. Efron and Tibshirani (2007)
suggested to employ both permutation types to test the significance
of gene sets. Other examples of permutation tests in bioinformatics
include, but are not limited to: QTL detection (Doerge and Churchill,
1996), allelic association analysis (Zhao et al., 2000) and modeling
ChIP sequencing (Zhang et al., 2008). In the latter case, each
permutation corresponds to the simulation of a complete ChIP-seq
experiment.
As in all statistical hypothesis tests, the significance of a
permutation test is represented by its P-value. The P-value is the
probability of obtaining a result at least as extreme as the test statistic
given that the null hypothesis is true. In permutations tests, the null
hypothesis is defined as: the labels assigning samples to classes
are interchangeable (Edgington, 1980). Significantly, low P-values
indicate that the labels are not interchangeable and that the original
label configuration is relevant with respect to the data. The P-value
is assessed by performing all possible permutations and computing
the fraction of permutation values that are at least as extreme as the
test statistic obtained from the unpermuted data.
However in practical situations, it is (by far) not feasible to
perform all possible permutations. For example, class labels that
represent two classes with 50 samples each can be permuted in
15000 1029 different ways. Therefore, the P-value is approximated
by computing a limited number of permutations, say N , and then
computing the fraction of the N permutation values that are at least
as extreme as the test statistic. Usually, a pseudocount is added to
avoid P-values of zero, which occur when the test statistic is never
surpassed by the permutation values. Theoretically, a P-value of zero
is not possible in the context of permutation tests: the minimum is
1/Nall, where Nall is the number of all possible permutations. This
is because one of the permuted label configurations is identical to
the original one, under which the test statistic is computed.
This empirical approximation of computing P-values has two
direct consequences. First, the resolution of obtainable P-values is
1/N . Second and more important, the smallest achievable P-value
is 1/N . This means that a very large number of permutations is
required to accurately estimate a small P-value. In general, >N
permutations are required to reliably estimate a P-value of 1/N .
(This is shown more extensively later in the manuscript.) Especially
in bioinformatics, it is crucial to be able to accurately determine
small P-values. This is due to typically huge numbers of objects
[e.g. all genes, gene sets or single nucleotide polymorphism (SNPs)]
that are simultaneously tested, which requires large multiple testing
corrections to prevent large numbers of false positives. Clearly, other
significance scores that are based on the P-values, such as the false
discovery rate (FDR), will be meaningless when the P-values are
not estimated correctly.
In this work, we propose to estimate the small permutation
test P-values using extreme value theory (Gumbel, 1958). The
set of extreme (very large or very small) permutation values
that forms the tail of the distribution of permutation values is
modeled as a generalized Pareto distribution (GPD). Pickands
(1975) demonstrated that the GPD approximates the distribution
of the extreme values of a set of independent and identically
distributed (i.i.d). random variables, i.e. those values that exceed
a particular (high) threshold. Applications of the GPD to model
extreme values are traditionally found in climatology to model
extreme weather, such as floods, and in financial risk management
to model extreme losses and insurance claims. In our case, the GPD,
which is fitted on the extreme permutation values, is evaluated
at the value of the test statistic (...truncated)