Improving confidence intervals for normed test scores: Include uncertainty due to sampling variability
Behavior Research Methods (2019) 51:826–839
https://doi.org/10.3758/s13428-018-1122-8
Improving confidence intervals for normed test scores: Include
uncertainty due to sampling variability
Lieke Voncken1 · Casper J. Albers1 · Marieke E. Timmerman1
Published online: 6 November 2018
© The Author(s) 2018
Abstract
Test publishers usually provide confidence intervals (CIs) for normed test scores that reflect the uncertainty due to the
unreliability of the tests. The uncertainty due to sampling variability in the norming phase is ignored. To express uncertainty
due to norming, we propose a flexible method that is applicable in continuous norming and allows for a variety of score
distributions, using Generalized Additive Models for Location, Scale, and Shape (GAMLSS; Rigby & Stasinopoulos, 2005).
We assessed the performance of this method in a simulation study, by examining the quality of the resulting CIs. We varied
the population model, procedure of estimating the CI, confidence level, sample size, value of the predictor, extremity of the
test score, and type of variance-covariance matrix. The results showed that good quality of the CIs could be achieved in most
conditions. The method is illustrated using normative data of the SON-R 6-40 test. We recommend test developers to use
this approach to arrive at CIs, and thus properly express the uncertainty due to norm sampling fluctuations, in the context
of continuous norming. Adopting this approach will help (e.g., clinical) practitioners to obtain a fair picture of the person
assessed.
Keywords Continuous norming · GAMLSS · Box-Cox power exponential distribution · Posterior simulation ·
Psychological tests
Introduction
Norms are needed to give an interpretation of someone’s test
score. A normed score can be expressed in different ways,
like a percentile and z score. It indicates the person’s relative
standing on the test to other people in the population. For
instance, the normed scores of intelligence tests are typically
expressed as normalized intelligence quotient (IQ) scores,
with a population mean of 100 and standard deviation of 15,
yielding an immediate interpretation of any observed IQ score.
Normed tests are often applied as high-stakes tests,
meaning that they are used to make important decisions
Electronic supplementary material The online version of
this article (https://doi.org/10.3758/s13428-018-1122-8) contains
supplementary material, which is available to authorized users.
Lieke Voncken
1
Department Psychometrics & Statistics, Faculty of Behavioural
and Social Sciences, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands
about individuals. A clear example relates to the fact that
mentally retarded individuals are exempted from death
penalty in 18 of the United States (Death Penalty Information Center, 2015). Some states, like Idaho and Florida,
use IQ scores to identify mental retardation, applying a
rigid cutoff (i.e., observed IQ score ≤ 70). Another instance
of the use of a rigid cutoff can be found in the Netherlands, where mental retardation indicated by an observed
IQ score of 85 or below qualifies for the long-term care act
(Zorginstituut Nederland, 2017), allowing the financing of
supervised living and debt repayment programs.
In using test scores for important individual decisions, it
is essential to acknowledge the uncertainty in observed test
scores. There is an increasing awareness of the importance
of reflecting this uncertainty. For instance, in the fifth
edition of the DSM (Diagnostic and Statistical Manual
of Mental Disorders; American Psychiatric Association,
2013), unlike earlier editions, a standard error of 5 IQ
points was explicitly included in defining the upper range
of intellectual disability. These expressions of uncertainty
in observed test scores reflect the notion that observed
scores may differ across assessments, even if the individual
Behav Res (2019) 51:826–839
assessed would remain exactly the same, or two individuals
would be exactly the same, on the characteristic measured.
In line with this increased awareness, the Dutch Committee on Testing (COTAN) recommends test publishers to
report information regarding the accuracy of the test (i.e.,
standard error of measurement, standard error of estimate,
or test information function/standard error) and the appropriate intervals (Evers et al., 2009). Nowadays, many test
publishers express this uncertainty related to test reliability, e.g. the WISC-IV (Wechsler, 2003) and the Bayley-III
(Bayley, 2006).
Nevertheless, this is insufficient for normed scores,
because it ignores another source of uncertainty, namely
due to the test norming itself. Test norming takes place
on the basis of a norming sample, rather than the full
population, implying that the norms themselves are due to
sampling fluctuations. This source of uncertainty in normed
test scores has been acknowledged only recently, with the
proposal of two methods to estimate CIs for normed test
scores, under the assumption that the norming sample stems
from a single population.
Crawford et al. (2011) proposed a method to obtain
CIs around percentile norms, under the assumption that
the scores in the norm population are normally distributed.
Recently, Oosterhuis et al. (2017) derived standard errors
for four different norm statistics (standard deviation,
percentile ranks, stanine boundaries, and z scores), under
the assumption that the scores in the norm population stem
from a multinomial distribution. As described by Oosterhuis
et al. (2016), this method can be applied to residuals of
raw test scores in the context of regression-based norming,
in which relevant personal characteristics (e.g., age) are
used to estimate the raw test score distribution. Even
though the method of Oosterhuis et al. (2017) has less
strict assumptions than the method of Crawford et al.
(2011), it still assumes normally distributed errors and
homoscedasticity of the error variances, which are often
unrealistic assumptions in practice. For instance, floor- and
ceiling effects may introduce skewness.
We propose a method to derive CIs indicating uncertainty
in normed scores that does not rely on those strict
assumptions. To this end, we use the flexible Generalized
Additive Models for Location, Scale, and Shape (GAMLSS;
Rigby and Stasinopoulos, 2005), which has been advocated
as a useful approach to continuous norming (e.g., Bayley-III
(Bayley, 2006) and SON-R 2-8 (Tellegen & Laros, 2017)).
GAMLSS includes a broad range of distributions, yielding
a good chance of finding a well-fitting distribution for
empirical normative data. Interestingly, the ordinary linear
regression model described by Oosterhuis et al. (2016) is
a restricted, special case of a model within the GAMLSS
framework.
827
GAMLSS
Applying GAMLSS implies that the score distribution is
modelled conditional on predictor(s) of interest (e.g., age),
based on certain distributional parameters. For instan (...truncated)