A novel piecewise-linear method for detecting associations between variables

PLOS ONE, Aug 2023

Detecting the association between two variables is necessary and meaningful in the era of big data. There are many measures to detect the association between them, some detect linear association, e.g., simple and fast Pearson correlation coefficient, and others detect nonlinear association, e.g., computationally expensive and imprecise maximal information coefficient (MIC). In our study, we proposed a novel maximal association coefficient (MAC) based on the idea that any nonlinear association can be considered to be composed of some piecewise-linear ones, which detects linear or nonlinear association between two variables through Pearson coefficient. We conduct experiments on some simulation data, with the results show that the MAC has both generality and equitability. In addition, we also apply MAC method to two real datasets, the major-league baseball dataset from Baseball Prospectus and dataset of credit card clients’ default, to detect the association strength of pairs of variables in these two datasets respectively. The experimental results show that the MAC can be used to detect the association between two variables, and it is computationally inexpensive and precise than MIC, which may be potentially important for follow-up data analysis and the conclusion of data analysis in the future.

A novel piecewise-linear method for detecting associations between variables

PLOS ONE RESEARCH ARTICLE A novel piecewise-linear method for detecting associations between variables Panru Wang ID, Junying Zhang ID* School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China * a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Wang P, Zhang J (2023) A novel piecewise-linear method for detecting associations between variables. PLoS ONE 18(8): e0290280. https://doi.org/10.1371/journal.pone.0290280 Editor: Lu Peng, Wuhan University of Technology, CHINA Received: November 28, 2022 Accepted: August 3, 2023 Abstract Detecting the association between two variables is necessary and meaningful in the era of big data. There are many measures to detect the association between them, some detect linear association, e.g., simple and fast Pearson correlation coefficient, and others detect nonlinear association, e.g., computationally expensive and imprecise maximal information coefficient (MIC). In our study, we proposed a novel maximal association coefficient (MAC) based on the idea that any nonlinear association can be considered to be composed of some piecewise-linear ones, which detects linear or nonlinear association between two variables through Pearson coefficient. We conduct experiments on some simulation data, with the results show that the MAC has both generality and equitability. In addition, we also apply MAC method to two real datasets, the major-league baseball dataset from Baseball Prospectus and dataset of credit card clients’ default, to detect the association strength of pairs of variables in these two datasets respectively. The experimental results show that the MAC can be used to detect the association between two variables, and it is computationally inexpensive and precise than MIC, which may be potentially important for follow-up data analysis and the conclusion of data analysis in the future. Published: August 24, 2023 Copyright: © 2023 Wang, Zhang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The first real data underlying the results presented in the study are available from https://www.seanlahman.com/ baseball-archive/statistics/. The second real data underlying the results presented in the study are available from https://archive.ics.uci.edu/ml/ datasets/default+of+credit+card+clients. Funding: This work was supported by the Natural Science Basic Research Program of Shaanxi Province, P.R.China (Program No. 2021SF-184). The funder Junying Zhang, the correspondent of this work, had role in study design, supervision, writing - review & editing, and decision to publish. 1 Introduction There are various linear or nonlinear associations [1–7] between two variables in the big data era. Detecting the association strength between them is necessary and meaningful for future data analysis [8–11]. Linear association between two variables can be detected through existing methods, however, nonlinear association cannot be detected well by using these existing methods. How to accurately detect the association between two variables is an urgent problem to be solved. The key indicators used to detect the association between two variables are Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information and distance correlation coefficient. They can detect the association strength between them, but there are also limitations. Galton [12] first proposed the concept of regression and applied the letter “r” to express the degree of correlation, however, he did not realize the concept of negative correlation. Subsequently, Pearson [13] proposed Pearson linear coefficient which is the quotient of covariance and standard deviation between two variables. The Pearson coefficient can be used for PLOS ONE | https://doi.org/10.1371/journal.pone.0290280 August 24, 2023 1 / 15 PLOS ONE Competing interests: The authors have declared that no competing interests exist. A novel piecewise-linear method for detecting associations between variables detecting the association between two variables, where the association is only statistically linear related. Therefore, Spearman [14] proposed Spearman coefficient on the basic of Pearson coefficient, which can detect linear or nonlinear associations between two variables, but these associations are monotonous. As time went by, more and more methods have been proposed to detect the association between two variables. Kendall raised Kendall coefficient [15] also called Harmony coefficient, but the data must be sorted out by the method of rating. Whereafter, Shannon [16] proposed mutual information [17, 18], which is difficult to calculate because it involves probability density. In 2007, Székely [19] proposed a new statistical correlation method, distance correlation coefficient, which made improvement in the Pearson coefficient’s shortcoming. If there is a nonlinear association between two variables, even if the value of Pearson coefficient is 0, we can’t arbitrarily think that there is no association between them; but if the value of distance correlation coefficient is 0, we can directly think there is no association between them without further analysis. Broadly speaking, these indicators, Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information, and distance correlation, all can be used to detect the association between two variables. However, these measures have some shortcomings: Pearson coefficient only detects linear association, Spearman coefficient is low precision, Kendall coefficient requires ordered variables, mutual information is difficult to calculate, the distance correlation coefficient is not necessarily 0 when variables are independent. There are various associations between two variables, which may be some complex nonlinear associations, and may not even be expressed by mathematical functions. In modern times, many measures have been proposed to detect the association between them. Wang et al in 2011 proposed a new measure, R correlation coefficient, to detect linear or simple nonlinear relationship between two variables [20]. The R correlation coefficient is based on the mathematical statistics, and only one simple example is used to prove this measure, which is lack of experimental proof. Meanwhile, Reshef [21] et al proposed a widely used measure, maximal information coefficient (MIC), which can detect extensive correlation relationships such as linear, exponential, periodic, even all functional relationships (a superposition of functions—are not well modeled by a function), but it has high computational complexity. Next, Wijayatunga in 2016 proposed a generalized Pearson coefficient [22] and argued that it can detect any nonlinear dependence if a suitable distance metric was u (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0290280&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0290280

Panru Wang, Junying Zhang. A novel piecewise-linear method for detecting associations between variables, PLOS ONE, 2023, Volume 18, Issue 8, DOI: 10.1371/journal.pone.0290280