Regression Models for Count Data: Illustrations using Longitudinal Predictors of Childhood Injury
Regression Models for Count Data: Illustrations using Longitudinal
Predictors of Childhood Injury*
Bryan T. Karazsia, MA and Manfred H. M. van Dulmen, PHD
Kent State University
Key words
count data; injury; regression.
Count data with a preponderance of zeros are frequently
analyzed by pediatric psychologists. Common examples of
such count data include number of patient hospitalizations
(Logan, Radcliffe, & Smith-Whitley, 2002), frequency
of adolescent alcohol use (Audrain-McGovern, Rodriguez,
Tercyak, Neuner, & Moss, 2006), and number of childhood
injuries (Morrongiello, Ondejko, & Littlejohn, 2004;
Schwebel, Brezausek, Ramey, & Ramey, 2004). Distributions of such data violate fundamental assumptions of
many commonly used multivariate statistical techniques
[e.g., ordinary least squares (OLS) regression], leading to
results that do not accurately reflect the observed data
(Hammer & Landau, 1981). Fairly recently, statistical techniques that overcome these problems have been developed
(Hall, 2000; Lambert, 1992). Even though these techniques
are better suited to handle count data on a dependent variable than for example OLS regression, few pediatric psychologists are familiar with these techniques. The goal of the
present article is therefore to illustrate the use of these techniques by offering a practical demonstration using prospective data from the National Institute of Child Health and
Human Development (NICHD) Study of Early Child Care.
Understanding Count Data
A count refers to the number of specified events that occur
in a given interval of time. By definition, count data consist
of only nonnegative integers. The specified event can
include any behavior of interest, and counts are utilized
frequently in the field of pediatric psychology. For example, in a recent analysis of service use among adolescents
with sickle cell disease, Logan and colleagues (2002)
reported frequencies of hospitalizations over a one year
period. Data collected from medical chart reviews were
summed to create a single variable depicting the number
of hospitalizations. As is common with count variables, the
authors reported that >50% of participants had not been
hospitalized (Logan et al., 2002). In other words, because
such a large number of individuals had not experienced
this event, we would refer to this count variable as being
zero-inflated. Other recent examples within pediatric psychology of such zero-inflated count data include adolescent substance use (Audrain-McGovern et al., 2006),
number of sexual partners (Prinstein, Meade, & Cohen,
2003), and children’s history of injuries (Hagan &
Kuebli, 2007).
*Portions of this article were presented at the 2008 National Conference in Child Health Psychology, Miami, FL.
All correspondence concerning this article should be addressed to Bryan T. Karazsia, Department of Psychology, Kent State
University, Kent, OH 44242, USA. E-mail:
Journal of Pediatric Psychology 33(10) pp. 1076–1084, 2008
doi:10.1093/jpepsy/jsn055
Advance Access publication June 3, 2008
Journal of Pediatric Psychology vol. 33 no. 10 ß The Author 2008. Published by Oxford University Press on behalf of the Society of Pediatric Psychology.
All rights reserved. For permissions, please e-mail:
Objective To offer a practical demonstration of regression models recommended for count outcomes
using longitudinal predictors of children’s medically attended injuries. Method Participants included
708 children from the NICHD child care study. Measures of temperament, attention, parent–child
relationship, and safety of physical environment were used to predict medically attended
injuries. Results Statistical comparisons among five estimation methods revealed that a zero-inflated
Poisson (ZIP) model provided the best fit with observed data. ZIP models simultaneously model dichotomous
and continuous outcomes of count variables, and different constellations of predictors emerged for each
aspect of the estimated model. Conclusions This study offers a practical demonstration of techniques
designed to handle dependent count variables. The conceptual and statistical advantages of these methods
are emphasized, and Stata script is provided to facilitate adoption of these techniques.
Analysis of Count Data
Potential ‘‘Solutions’’
Traditionally, researchers have used two solutions to deal
with zero-inflated count data. First, researchers have opted
to transform such data. A square root transformation has
been recommended for count data (Johnson & Wichern,
1998), though several problems with transformations of
count variables are documented (see Sturman, 1999 for
review). Most notably, they do not address the high
preponderance of zeros, so meaningless values are predicted (e.g., negative values even though counts can be
only positive; Hammer & Landau, 1981; Harrison &
Hulin, 1989). In addition, transformed data are more difficult to interpret than nontransformed data (Tabachnick
& Fidell, 2007).
Another commonly used approach is to dichotomize
data into groups: those who performed the behavior
(nonzero counts) and those who did not (zero counts).
For example, one may be interested in the factors that
predict whether or not adolescents are hospitalized.
This approach is problematic because dichotomization
ignores meaningful variation, and as such, occasions to
which dichotomization can be applied are rare
(MacCallum, Zhang, Preacher, & Rucker, 2002).
Alternative Models
Fortunately, numerous models have been developed
specifically for count data (Long & Freese, 2006; Sano,
Jeong, Acock, & Zvonkovic, 2005). These models can
handle nonnormality on the dependent variable and do
not require the researcher to either dichotomize or transform the dependent variable. We focus on four of these
models (Atkins & Gallop, 2007; Long & Freese, 2006;
Sano et al., 2005): Poisson, negative binomial, zero-inflated
Poisson (ZIP), and zero-inflated negative binomial (ZINB).
1
While there is no explicit assumption about distributions of
dependent variables in OLS regression (Tabachnick & Fidell, 2007),
they have a strong influence on the distribution of residuals (Atkins
& Gallop, 2007).
Poisson
The Poisson distribution was developed to model discrete
counts, and because it is similar to linear regression in
many respects, it is relatively easy to interpret.2 This distribution becomes increasingly positively skewed as the
mean of the dependent variable decreases (Long &
Freese, 2006), reflecting a common property of count data.
The apparent simplicity of Poisson comes with two
restrictive assumptions (Sturman, 1999). First, the variance and mean of the count variable are assumed to be
equal. In reality, however, the variance is usually much
greater than the mean (i.e., overdispersion; Cameron &
Trivedi, 1986) and therefore Poisson models—though
widely used to handle count data—may not be well
suited to handle some types of count outcomes. Another
restrictive assumption of Poisson models is that occurrenc (...truncated)