A novel piecewise-linear method for detecting associations between variables
PLOS ONE
RESEARCH ARTICLE
A novel piecewise-linear method for detecting
associations between variables
Panru Wang ID, Junying Zhang ID*
School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
*
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Wang P, Zhang J (2023) A novel
piecewise-linear method for detecting associations
between variables. PLoS ONE 18(8): e0290280.
https://doi.org/10.1371/journal.pone.0290280
Editor: Lu Peng, Wuhan University of Technology,
CHINA
Received: November 28, 2022
Accepted: August 3, 2023
Abstract
Detecting the association between two variables is necessary and meaningful in the era of
big data. There are many measures to detect the association between them, some detect
linear association, e.g., simple and fast Pearson correlation coefficient, and others detect
nonlinear association, e.g., computationally expensive and imprecise maximal information
coefficient (MIC). In our study, we proposed a novel maximal association coefficient (MAC)
based on the idea that any nonlinear association can be considered to be composed of
some piecewise-linear ones, which detects linear or nonlinear association between two variables through Pearson coefficient. We conduct experiments on some simulation data, with
the results show that the MAC has both generality and equitability. In addition, we also apply
MAC method to two real datasets, the major-league baseball dataset from Baseball Prospectus and dataset of credit card clients’ default, to detect the association strength of pairs
of variables in these two datasets respectively. The experimental results show that the MAC
can be used to detect the association between two variables, and it is computationally inexpensive and precise than MIC, which may be potentially important for follow-up data analysis and the conclusion of data analysis in the future.
Published: August 24, 2023
Copyright: © 2023 Wang, Zhang. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: The first real data
underlying the results presented in the study are
available from https://www.seanlahman.com/
baseball-archive/statistics/. The second real data
underlying the results presented in the study are
available from https://archive.ics.uci.edu/ml/
datasets/default+of+credit+card+clients.
Funding: This work was supported by the Natural
Science Basic Research Program of Shaanxi
Province, P.R.China (Program No. 2021SF-184).
The funder Junying Zhang, the correspondent of
this work, had role in study design, supervision,
writing - review & editing, and decision to publish.
1 Introduction
There are various linear or nonlinear associations [1–7] between two variables in the big data
era. Detecting the association strength between them is necessary and meaningful for future
data analysis [8–11]. Linear association between two variables can be detected through existing
methods, however, nonlinear association cannot be detected well by using these existing methods. How to accurately detect the association between two variables is an urgent problem to be
solved.
The key indicators used to detect the association between two variables are Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information and distance correlation
coefficient. They can detect the association strength between them, but there are also limitations. Galton [12] first proposed the concept of regression and applied the letter “r” to express
the degree of correlation, however, he did not realize the concept of negative correlation. Subsequently, Pearson [13] proposed Pearson linear coefficient which is the quotient of covariance
and standard deviation between two variables. The Pearson coefficient can be used for
PLOS ONE | https://doi.org/10.1371/journal.pone.0290280 August 24, 2023
1 / 15
PLOS ONE
Competing interests: The authors have declared
that no competing interests exist.
A novel piecewise-linear method for detecting associations between variables
detecting the association between two variables, where the association is only statistically linear
related. Therefore, Spearman [14] proposed Spearman coefficient on the basic of Pearson coefficient, which can detect linear or nonlinear associations between two variables, but these associations are monotonous. As time went by, more and more methods have been proposed to
detect the association between two variables. Kendall raised Kendall coefficient [15] also called
Harmony coefficient, but the data must be sorted out by the method of rating. Whereafter,
Shannon [16] proposed mutual information [17, 18], which is difficult to calculate because it
involves probability density. In 2007, Székely [19] proposed a new statistical correlation
method, distance correlation coefficient, which made improvement in the Pearson coefficient’s
shortcoming. If there is a nonlinear association between two variables, even if the value of
Pearson coefficient is 0, we can’t arbitrarily think that there is no association between them;
but if the value of distance correlation coefficient is 0, we can directly think there is no association between them without further analysis. Broadly speaking, these indicators, Pearson coefficient, Spearman coefficient, Kendall coefficient, mutual information, and distance correlation,
all can be used to detect the association between two variables. However, these measures have
some shortcomings: Pearson coefficient only detects linear association, Spearman coefficient
is low precision, Kendall coefficient requires ordered variables, mutual information is difficult
to calculate, the distance correlation coefficient is not necessarily 0 when variables are
independent.
There are various associations between two variables, which may be some complex nonlinear associations, and may not even be expressed by mathematical functions. In modern times,
many measures have been proposed to detect the association between them. Wang et al in
2011 proposed a new measure, R correlation coefficient, to detect linear or simple nonlinear
relationship between two variables [20]. The R correlation coefficient is based on the mathematical statistics, and only one simple example is used to prove this measure, which is lack of
experimental proof. Meanwhile, Reshef [21] et al proposed a widely used measure, maximal
information coefficient (MIC), which can detect extensive correlation relationships such as linear, exponential, periodic, even all functional relationships (a superposition of functions—are
not well modeled by a function), but it has high computational complexity. Next, Wijayatunga
in 2016 proposed a generalized Pearson coefficient [22] and argued that it can detect any nonlinear dependence if a suitable distance metric was u (...truncated)