Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data

Annals of Data Science, Apr 2023

The aim of this study is to investigate the overdispersion problem that is rampant in ecological count data. In order to explore this problem, we consider the most commonly used count regression models: the Poisson, the negative binomial, the zero-inflated Poisson and the zero-inflated negative binomial models. The performance of these count regression models is compared with the four proposed machine learning (ML) regression techniques: random forests, support vector machines, $$k-$$ nearest neighbors and artificial neural networks. The mean absolute error was used to compare the performance of count regression models and ML regression models. The results suggest that ML regression models perform better compared to count regression models. The performance shown by ML regression techniques is a motivation for further research in improving methods and applications in ecological studies.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40745-023-00464-6.pdf

Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data

Annals of Data Science https://doi.org/10.1007/s40745-023-00464-6 Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data Bonelwa Sidumo1 · Energy Sonono1 · Isaac Takaidza1 Received: 22 October 2021 / Revised: 17 January 2023 / Accepted: 21 March 2023 © The Author(s) 2023 Abstract The aim of this study is to investigate the overdispersion problem that is rampant in ecological count data. In order to explore this problem, we consider the most commonly used count regression models: the Poisson, the negative binomial, the zero-inflated Poisson and the zero-inflated negative binomial models. The performance of these count regression models is compared with the four proposed machine learning (ML) regression techniques: random forests, support vector machines, k−nearest neighbors and artificial neural networks. The mean absolute error was used to compare the performance of count regression models and ML regression models. The results suggest that ML regression models perform better compared to count regression models. The performance shown by ML regression techniques is a motivation for further research in improving methods and applications in ecological studies. Keywords Count data · Ecology · Machine learning · Overdispersion · Zero-inflation Energy Sonono and Isaac Takaidza have contributed equally to this work. B Bonelwa Sidumo Energy Sonono Isaac Takaidza 1 School of Mathematical and Statistical Sciences, North-West University, Hendrick Van Eck Blvd, Vanderbijlpark 1911, Gauteng, South Africa 123 Annals of Data Science 1 Introduction The aim of this article is to investigate the problem of overdispersion in ecological count data. Overdispersion is an existing and recurring problem that needs attention when dealing with ecological count data. Ignoring overdispersion will cause difficulties in analysis and the decision-making procedures of ecological studies. We approach the problem of overdispersion by using machine learning (ML) regression techniques. To the best of our knowledge an approach to overdispersion in ecological studies using ML techniques has not been extensively researched thus far. Bolker et al. [1] define overdispersion as the occurrence of more variance in the data than predicted by a statistical model owing to missing observations. The reasons for the existence of missing observations in ecological count data may be due to structural errors (for instance, a bird or fish is not present because the habitat is not suitable), observer error (species are present but cannot be detected) and design error (poor experimental design or sampling surveys are thought to be the reason) [2]. The literature discusses the source of zeros in ecological count data and defines them as either ‘true zero counts’ or ‘false zero counts’ [3, 4]. False zero counts occur when species are present at a site during the survey period, but the observer fails to detect them and true zero counts occur when species do not occur at a site because of the ecological process, that is, habitat unsuitability. This study focuses only on false zero counts. In ecology, zero counts do not necessarily mean that there are no species detected during the sampling survey [5]; rather it means that there were no species at that particular sampling time (false zeros). The absence of species results in excess number of zeros termed zero-inflation. The presence of zero-inflation in this study is owing to observer error. This study will provide an overview of various count regression models: the Poisson, the negative binomial (NB), the zero-inflated Poisson (ZIP) and the zero-inflated negative binomial model (ZINB). The Poisson regression model has been widely used to analyse count data under the assumption of equidispersion, that is, the mean of the response variable is equal to the variance of the response variable [6, 7]. However, as much as this is a naturally occurring property of the Poisson regression model, it is not always true in real life ecological count data as counts may exhibit excess variability. The fact that equidispersion is rarely found in real data has resulted in the development of more general count models which do not assume equidispersion [8]. The NB regression model has been used as an alternative model to the Poisson regression model (see [9, 10]). NB regression models are more flexible than Poisson regression models even though they do not provide exact predictions in certain situations [11]. The next alternative used for modeling count data with excess number of zeros is the ZIP model, which has been applied in many areas of research such as insurance claims [12, 13], education [11, 14], healthcare [15–17] animal ecology [18] and transport [19]. ZIP was found to be inappropriate for data that are both zero-inflated and overdispersed [20]. Furthermore, Minami et al. [20] and Rose et al. [21] propose the ZINB model as another alternative to handle overdispersion. ZINB has been shown to be appropriate to some ecological situations even though the issue of overdispersion still remains [20]. From the empirical evidence provided above, the proposed methods still pose challenges in dealing with overdispersion. There are still numerous false zeros that are 123 Annals of Data Science being observed or not accounted for. In other words, there is still room to improve the reduction of overdispersion in count data, which this study proposes via a possible new method. Most published studies in ecology sometimes fail to report on overdispersion in respect of their best fitting models [22]. Failing to account for overdispersion can lead to incorrect inferences [23]. There is also limited literature in ecological studies about how overdispersion affects results as researchers would identify predictors as having biologically meaningful effects when, in fact they do not [6]. As a result of the limitations in some statistical methods (for example, Poisson and NB) and the diversity of data, new techniques of data science have been developed. Data science has become an important and growing field as the Internet of Things (IoT) expands worldwide [24]. Encompassing several techniques such as data mining and machine learning, data science solves relevant problems and predicts results by taking into account data quality [25]. The data science techniques have been applied in various research fields such as healthcare [7] and education [26] and these techniques are combined to consolidate statistical analyses. This study proposes machine learning (ML) regression techniques; random forests (RF), support vector machines (SVM), k−nearest neighbors (kNN) and artificial neural networks (ANN) to handle the problem of overdispersion in ecological count data. Lately, ML methods have been cropping up in different areas of science. However, to the best of our knowledge, there is limited empirical evidence showing the use of ML (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s40745-023-00464-6.pdf
Article home page: https://link.springer.com/article/10.1007/s40745-023-00464-6

Sidumo, Bonelwa, Sonono, Energy, Takaidza, Isaac. Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data, Annals of Data Science, 2023, pp. 1-15, DOI: 10.1007/s40745-023-00464-6