Multivariate Welch t-test on distances

Bioinformatics, Dec 2016

Motivation: Permutational non-Euclidean analysis of variance, PERMANOVA, is routinely used in exploratory analysis of multivariate datasets to draw conclusions about the significance of patterns visualized through dimension reduction. This method recognizes that pairwise distance matrix between observations is sufficient to compute within and between group sums of squares necessary to form the (pseudo) F statistic. Moreover, not only Euclidean, but arbitrary distances can be used. This method, however, suffers from loss of power and type I error inflation in the presence of heteroscedasticity and sample size imbalances. Results: We develop a solution in the form of a distance-based Welch t-test, <mml:math display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mi mathvariant="normal">W</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:math>TW2, for two sample potentially unbalanced and heteroscedastic data. We demonstrate empirically the desirable type I error and power characteristics of the new test. We compare the performance of PERMANOVA and <mml:math display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mi mathvariant="normal">W</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:math>TW2 in reanalysis of two existing microbiome datasets, where the methodology has originated. Availability and Implementation: The source code for methods and analysis of this article is available at https://github.com/alekseyenko/Tw2. Further guidance on application of these methods can be obtained from the author. Contact: alekseye{at}musc.edu

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://bioinformatics.oxfordjournals.org/content/32/23/3552.full.pdf

Multivariate Welch t-test on distances

Bioinformatics Multivariate Welch t-test on distances Alexander V. Alekseyenko 0 0 Departments of Public Health Sciences and Oral Health Sciences, Program for Human Microbiome Research, The Biomedical Informatics Center Medical University of South Carolina , 135 Cannon Street, MSC 200, Charleston, SC 29466 , USA Motivation: Permutational non-Euclidean analysis of variance, PERMANOVA, is routinely used in exploratory analysis of multivariate datasets to draw conclusions about the significance of patterns visualized through dimension reduction. This method recognizes that pairwise distance matrix between observations is sufficient to compute within and between group sums of squares necessary to form the (pseudo) F statistic. Moreover, not only Euclidean, but arbitrary distances can be used. This method, however, suffers from loss of power and type I error inflation in the presence of heteroscedasticity and sample size imbalances. Results: We develop a solution in the form of a distance-based Welch t-test, T W2, for two sample potentially unbalanced and heteroscedastic data. We demonstrate empirically the desirable type I error and power characteristics of the new test. We compare the performance of PERMANOVA and T W2 in reanalysis of two existing microbiome datasets, where the methodology has originated. Availability and Implementation: The source code for methods and analysis of this article is available at https://github.com/alekseyenko/Tw2. Further guidance on application of these methods can be obtained from the author. Contact: 1 Introduction The PERMANOVA test (Anderson, 2001), has been proposed for use in numerical ecology to test for the location differences in microbial communities. The relationships between these communities are typically described by ecological distance metrics (e.g. Jaccard, ChiSquared, Bray-Curtis) and visualized through dimension reduction (also referred to as ordination in numerical ecology literature). The PERMANOVA permutation test based on (pseudo) F statistic computed directly from distances is a widely accepted means of establishing statistical significance for observed patterns. This test and the extension of this paper are related to the multivariate BehrensFisher problem (Krishnamoorthy and Yu, 2004) of testing the difference in multivariate means of samples from several populations. The underlying statistics for both distance-based tests are related to the Hotelling T2 statistic. The PERMANOVA is more general in allowing for more than two populations to be compared simultaneously. The distance-based geometric approach; however, forgoes the need to estimate the covariance matrices. The cost of these geometric approaches is that they only provide omnibus tests, which are unable to make inferences about individual components of the multivariate random vectors tested. With the revived interest in numerical ecology fueled by the availability of DNA sequencing-based high-throughput microbial community profiling, i.e. microbiomics, the PERMANOVA test is enjoying a new wave of popularity. Several, cautionary articles have been published noting the undesired behavior of the test in heteroscedastic conditions (Warton et al., 2012). A definitive principled solution to this issue is still lacking, however. The consensus is to ascertain the presence of heteroscedasticity using an additional test (e.g. PERMDISP; Anderson, 2006; Anderson et al., 2006) in case of positive PERMANOVA results and to report both with a disclaimer that the attribution of positive PERMANOVA test to location or dispersion differences cannot be made whenever both tests yield positive results. In reality, the exactly matching multivariate spread between factor levels can rarely be assumed and the robustness of PERMANOVA to violations of homoscedasticity has not been characterized empirically. 1.1 Performance of PERMANOVA in heteroscedastic data We demonstrate the adverse behavior of PERMANOVA in unbalanced heteroscedastic case via a simulation. Let sample one consists of observations from 1000-dimensional uncorrelated multivariate normal distribution, where each component is standard normal (mean 0 and SD 1). Sample two is likewise 1000-dimensional uncorrelated multivariate normal with means equal to 1=pffi1ffiffi0ffiffiffi0ffiffi0ffiffi fraction of the desired effect size and standard deviation equal 0.8. Thus sample one has 20% more multivariate spread than sample two. We set the effect size to 0, 2, 4 and 5. We compute the corresponding Euclidean distances for use with PERMANOVA test, using its implementation in the adonis() function of the R (R Core Team, 2015) package vegan (Oksanen et al., 2015). We repeat the simulation 1,000 times for each set of parameters and compute the average rejection rate at a ¼ 0.05. Figure 1 summarizes the type I error and power characteristics for this design with varying sample sizes. First, note that the type I error (left most box, where effect size equal to 0) is on (...truncated)


This is a preview of a remote PDF: https://bioinformatics.oxfordjournals.org/content/32/23/3552.full.pdf

Alexander V. Alekseyenko. Multivariate Welch t-test on distances, Bioinformatics, 2016, pp. 3552-3558, 32/23, DOI: 10.1093/bioinformatics/btw524