The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions
Matthew Hankins
0
1
2
0
Brighton & Sussex University Hospitals NHS Trust, Royal Sussex County Hospital
,
Brighton
,
UK
1
Department of Primary Care & Public Health, Brighton & Sussex Medical School
,
Brighton
,
UK
2
King's College London, Department of Psychology (at Guy's), Institute of Psychiatry
,
London
,
UK
Background: The twelve-item General Health Questionnaire (GHQ-12) was developed to screen for non-specific psychiatric morbidity. It has been widely validated and found to be reliable. These validation studies have assumed that the GHQ-12 is one-dimensional and free of response bias, but recent evidence suggests that neither of these assumptions may be correct, threatening its utility as a screening instrument. Further uncertainty arises because of the multiplicity of scoring methods of the GHQ-12. This study set out to establish the best fitting model for the GHQ-12 for three scoring methods (Likert, GHQ and C-GHQ) and to calculate the degree of measurement error under these more realistic assumptions. Methods: GHQ-12 data were obtained from the Health Survey for England 2004 cohort (n = 3705). Structural equation modelling was used to assess the fit of [1] the one-dimensional model [2] the current 'best fit' three-dimensional model and [3] a one-dimensional model with response bias. Three different scoring methods were assessed for each model. The best fitting model was assessed for reliability, standard error of measurement and discrimination. Results: The best fitting model was one-dimensional with response bias on the negatively phrased items, suggesting that previous GHQ-12 factor structures were artifacts of the analysis method. The reliability of this model was over-estimated by Cronbach's Alpha for all scoring methods: 0.90 (Likert method), 0.90 (GHQ method) and 0.75 (C-GHQ). More realistic estimates of reliability were 0.73, 0.87 and 0.53 (C-GHQ), respectively. Discrimination (Delta) also varied according to scoring method: 0.94 (Likert method), 0.63 (GHQ method) and 0.97 (C-GHQ method). Conclusion: Conventional psychometric assessments using factor analysis and reliability estimates have obscured substantial measurement error in the GHQ-12 due to response bias on the negative items, which limits its utility as a screening instrument for psychiatric morbidity.
-
Background
The twelve-item General Health Questionnaire (GHQ-12)
is intended to screen for general (non-psychotic)
psychiatric morbidity [1]. It has been widely used and, as a result,
translated into many languages and extensively validated
in general and clinical populations worldwide [2]. The
validation process has been principally psychometric in
nature, focusing on the reliability and validity of the data
generated, with additional support coming from studies
of the sensitivity and specificity of the measurement [2,3].
Despite this, the utility of using self-report measures such
as the GHQ-12 has been questioned, with a recent review
concluding that clinicians may find the low positive
predictive value of this method unconvincing as a diagnostic
aid [4]. This raises the question of whether psychometric
validation alone is a sufficient basis for adopting the
GHQ-12 as a screening instrument in clinical practice. In
clinical practice, poor positive predictive value means that
many of those screening positive are not suffering from a
psychiatric disorder but may be deemed to warrant further
investigation; in a research context it means that many
participants will be misclassified, a form of measurement
error that will bias subsequent analyses [5].
In classical test theory, a test or questionnaire is assessed
for dimensionality, reliability and validity [6].
Dimensionality is assessed using factor analysis, a method based
on the pattern of correlations between the questionnaire
item scores. If all items share moderate to strong
correlations, this produces a single 'factor' and suggests that the
scale measures a single dimension. Several groups of such
items produce several factors, suggesting that several
dimensions are being measured. Since the method
depends on the inter-item correlations, anything that
produces correlated items will be interpreted as a factor, and
therefore caution should be exercised when interpreting
factor structures as substantive dimensions [6]. Reliability
is an estimate of the degree of measurement error entailed
in the measurement of a single dimension by several
items. If a questionnaire measures several dimensions,
then each requires an estimate of reliability. Several
methods are commonly used to estimate reliability (for
example, Cronbach's Alpha or test-retest correlations), but all
rely on the correlation between items (Alpha) or scale
scores (test-retest). In addition, the interpretation of the
resulting reliability coefficient depends on some strong
assumptions being met: most notably in the context of the
current study, there is the assumption that the
measurement error of each item is random (i.e. uncorrelated with
anything else). Finally, validity refers to the extent to
which the test or questionnaire measures what it is
supposed to measure. This is commonly assessed with
reference to some external criterion, but it should be clear that
a questionnaire intended to measure a single dimension
cannot be valid if it measures several dimensions, or if it
produces data with a high proportion of measurement
error. Hence, factor analysis and reliability estimates
contribute to the sufficiency of a measure, but do not
guarantee it.
While psychometric evaluation of the GHQ-12 suggests
that it is a valid measure of psychiatric morbidity (i.e. it
measures what it purports to measure), and also a reliable
measure (i.e. measurement error is low), examination of
the factor structure has repeatedly led to the conclusion
that the GHQ-12 measures psychiatric morbidity in more
than one domain [7]. These results have been interpreted
as evidence that the GHQ-12 measures more than one
dimension of psychiatric morbidity, although typically
each dimension has been found to be reliable and the
measurement error for each dimension acceptable.
Currently the consensus appears to be that the GHQ-12
measures psychiatric dysfunction in three domains, social
dysfunction, anxiety and loss of confidence [7-9], although
having been derived solely from factor analysis, both the
utility and the clinical ontology of these domains remains
unclear [10].
Another interpretation of this factor analytic evidence is
that the apparent multidimensional nature of the
GHQ12 is simply an artefact of the method of analysis, rather
than an aspect of the GHQ-12 itself [10]. The studies
reporting that the GHQ-12 is multidimensional used
either exploratory factor analysis (EFA) or confirmatory
factor analysis by structural equation modelling (SEM),
and it has long been known that these methods can
produce spurious dimensions even when the measure in
question is one-dimensional if the questionnaire
c (...truncated)