Multivariate Welch t-test on distances
Bioinformatics
Multivariate Welch t-test on distances
Alexander V. Alekseyenko 0
0 Departments of Public Health Sciences and Oral Health Sciences, Program for Human Microbiome Research, The Biomedical Informatics Center Medical University of South Carolina , 135 Cannon Street, MSC 200, Charleston, SC 29466 , USA
Motivation: Permutational non-Euclidean analysis of variance, PERMANOVA, is routinely used in exploratory analysis of multivariate datasets to draw conclusions about the significance of patterns visualized through dimension reduction. This method recognizes that pairwise distance matrix between observations is sufficient to compute within and between group sums of squares necessary to form the (pseudo) F statistic. Moreover, not only Euclidean, but arbitrary distances can be used. This method, however, suffers from loss of power and type I error inflation in the presence of heteroscedasticity and sample size imbalances. Results: We develop a solution in the form of a distance-based Welch t-test, T W2, for two sample potentially unbalanced and heteroscedastic data. We demonstrate empirically the desirable type I error and power characteristics of the new test. We compare the performance of PERMANOVA and T W2 in reanalysis of two existing microbiome datasets, where the methodology has originated. Availability and Implementation: The source code for methods and analysis of this article is available at https://github.com/alekseyenko/Tw2. Further guidance on application of these methods can be obtained from the author. Contact:
1 Introduction
The PERMANOVA test (Anderson, 2001), has been proposed for
use in numerical ecology to test for the location differences in
microbial communities. The relationships between these communities are
typically described by ecological distance metrics (e.g. Jaccard,
ChiSquared, Bray-Curtis) and visualized through dimension reduction
(also referred to as ordination in numerical ecology literature). The
PERMANOVA permutation test based on (pseudo) F statistic
computed directly from distances is a widely accepted means of
establishing statistical significance for observed patterns. This test and
the extension of this paper are related to the multivariate
BehrensFisher problem (Krishnamoorthy and Yu, 2004) of testing the
difference in multivariate means of samples from several populations. The
underlying statistics for both distance-based tests are related to the
Hotelling T2 statistic. The PERMANOVA is more general in
allowing for more than two populations to be compared simultaneously.
The distance-based geometric approach; however, forgoes the need
to estimate the covariance matrices. The cost of these geometric
approaches is that they only provide omnibus tests, which are
unable to make inferences about individual components of the
multivariate random vectors tested.
With the revived interest in numerical ecology fueled by the
availability of DNA sequencing-based high-throughput microbial
community profiling, i.e. microbiomics, the PERMANOVA test is enjoying a
new wave of popularity. Several, cautionary articles have been
published noting the undesired behavior of the test in heteroscedastic
conditions (Warton et al., 2012). A definitive principled solution to this
issue is still lacking, however. The consensus is to ascertain the
presence of heteroscedasticity using an additional test (e.g. PERMDISP;
Anderson, 2006; Anderson et al., 2006) in case of positive
PERMANOVA results and to report both with a disclaimer that the
attribution of positive PERMANOVA test to location or dispersion
differences cannot be made whenever both tests yield positive results.
In reality, the exactly matching multivariate spread between factor
levels can rarely be assumed and the robustness of PERMANOVA to
violations of homoscedasticity has not been characterized empirically.
1.1 Performance of PERMANOVA in heteroscedastic data
We demonstrate the adverse behavior of PERMANOVA in
unbalanced heteroscedastic case via a simulation. Let sample one consists of
observations from 1000-dimensional uncorrelated multivariate
normal distribution, where each component is standard normal (mean 0
and SD 1). Sample two is likewise 1000-dimensional uncorrelated
multivariate normal with means equal to 1=pffi1ffiffi0ffiffiffi0ffiffi0ffiffi fraction of the
desired effect size and standard deviation equal 0.8. Thus sample one
has 20% more multivariate spread than sample two. We set the effect
size to 0, 2, 4 and 5. We compute the corresponding Euclidean
distances for use with PERMANOVA test, using its implementation in
the adonis() function of the R (R Core Team, 2015) package vegan
(Oksanen et al., 2015). We repeat the simulation 1,000 times for each
set of parameters and compute the average rejection rate at a ¼ 0.05.
Figure 1 summarizes the type I error and power characteristics for this
design with varying sample sizes. First, note that the type I error (left
most box, where effect size equal to 0) is on (...truncated)