Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data
Annals of Data Science
https://doi.org/10.1007/s40745-023-00464-6
Count Regression and Machine Learning Techniques
for Zero-Inflated Overdispersed Count Data: Application
to Ecological Data
Bonelwa Sidumo1
· Energy Sonono1 · Isaac Takaidza1
Received: 22 October 2021 / Revised: 17 January 2023 / Accepted: 21 March 2023
© The Author(s) 2023
Abstract
The aim of this study is to investigate the overdispersion problem that is rampant in ecological count data. In order to explore this problem, we consider the most commonly
used count regression models: the Poisson, the negative binomial, the zero-inflated
Poisson and the zero-inflated negative binomial models. The performance of these
count regression models is compared with the four proposed machine learning (ML)
regression techniques: random forests, support vector machines, k−nearest neighbors
and artificial neural networks. The mean absolute error was used to compare the performance of count regression models and ML regression models. The results suggest
that ML regression models perform better compared to count regression models. The
performance shown by ML regression techniques is a motivation for further research
in improving methods and applications in ecological studies.
Keywords Count data · Ecology · Machine learning · Overdispersion · Zero-inflation
Energy Sonono and Isaac Takaidza have contributed equally to this work.
B Bonelwa Sidumo
Energy Sonono
Isaac Takaidza
1
School of Mathematical and Statistical Sciences, North-West University, Hendrick Van Eck Blvd,
Vanderbijlpark 1911, Gauteng, South Africa
123
Annals of Data Science
1 Introduction
The aim of this article is to investigate the problem of overdispersion in ecological
count data. Overdispersion is an existing and recurring problem that needs attention
when dealing with ecological count data. Ignoring overdispersion will cause difficulties in analysis and the decision-making procedures of ecological studies. We approach
the problem of overdispersion by using machine learning (ML) regression techniques.
To the best of our knowledge an approach to overdispersion in ecological studies using
ML techniques has not been extensively researched thus far.
Bolker et al. [1] define overdispersion as the occurrence of more variance in the
data than predicted by a statistical model owing to missing observations. The reasons
for the existence of missing observations in ecological count data may be due to
structural errors (for instance, a bird or fish is not present because the habitat is not
suitable), observer error (species are present but cannot be detected) and design error
(poor experimental design or sampling surveys are thought to be the reason) [2]. The
literature discusses the source of zeros in ecological count data and defines them as
either ‘true zero counts’ or ‘false zero counts’ [3, 4]. False zero counts occur when
species are present at a site during the survey period, but the observer fails to detect
them and true zero counts occur when species do not occur at a site because of the
ecological process, that is, habitat unsuitability. This study focuses only on false zero
counts. In ecology, zero counts do not necessarily mean that there are no species
detected during the sampling survey [5]; rather it means that there were no species at
that particular sampling time (false zeros). The absence of species results in excess
number of zeros termed zero-inflation. The presence of zero-inflation in this study is
owing to observer error.
This study will provide an overview of various count regression models: the Poisson, the negative binomial (NB), the zero-inflated Poisson (ZIP) and the zero-inflated
negative binomial model (ZINB). The Poisson regression model has been widely used
to analyse count data under the assumption of equidispersion, that is, the mean of the
response variable is equal to the variance of the response variable [6, 7]. However, as
much as this is a naturally occurring property of the Poisson regression model, it is not
always true in real life ecological count data as counts may exhibit excess variability.
The fact that equidispersion is rarely found in real data has resulted in the development
of more general count models which do not assume equidispersion [8]. The NB regression model has been used as an alternative model to the Poisson regression model (see
[9, 10]). NB regression models are more flexible than Poisson regression models even
though they do not provide exact predictions in certain situations [11]. The next alternative used for modeling count data with excess number of zeros is the ZIP model,
which has been applied in many areas of research such as insurance claims [12, 13],
education [11, 14], healthcare [15–17] animal ecology [18] and transport [19]. ZIP
was found to be inappropriate for data that are both zero-inflated and overdispersed
[20]. Furthermore, Minami et al. [20] and Rose et al. [21] propose the ZINB model as
another alternative to handle overdispersion. ZINB has been shown to be appropriate to
some ecological situations even though the issue of overdispersion still remains [20].
From the empirical evidence provided above, the proposed methods still pose challenges in dealing with overdispersion. There are still numerous false zeros that are
123
Annals of Data Science
being observed or not accounted for. In other words, there is still room to improve the
reduction of overdispersion in count data, which this study proposes via a possible new
method. Most published studies in ecology sometimes fail to report on overdispersion
in respect of their best fitting models [22]. Failing to account for overdispersion can
lead to incorrect inferences [23]. There is also limited literature in ecological studies
about how overdispersion affects results as researchers would identify predictors as
having biologically meaningful effects when, in fact they do not [6].
As a result of the limitations in some statistical methods (for example, Poisson and
NB) and the diversity of data, new techniques of data science have been developed.
Data science has become an important and growing field as the Internet of Things
(IoT) expands worldwide [24]. Encompassing several techniques such as data mining
and machine learning, data science solves relevant problems and predicts results by
taking into account data quality [25]. The data science techniques have been applied in
various research fields such as healthcare [7] and education [26] and these techniques
are combined to consolidate statistical analyses.
This study proposes machine learning (ML) regression techniques; random forests
(RF), support vector machines (SVM), k−nearest neighbors (kNN) and artificial neural
networks (ANN) to handle the problem of overdispersion in ecological count data.
Lately, ML methods have been cropping up in different areas of science. However,
to the best of our knowledge, there is limited empirical evidence showing the use of
ML (...truncated)