The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions

BMC Public Health, Oct 2008

Background The twelve-item General Health Questionnaire (GHQ-12) was developed to screen for non-specific psychiatric morbidity. It has been widely validated and found to be reliable. These validation studies have assumed that the GHQ-12 is one-dimensional and free of response bias, but recent evidence suggests that neither of these assumptions may be correct, threatening its utility as a screening instrument. Further uncertainty arises because of the multiplicity of scoring methods of the GHQ-12. This study set out to establish the best fitting model for the GHQ-12 for three scoring methods (Likert, GHQ and C-GHQ) and to calculate the degree of measurement error under these more realistic assumptions. Methods GHQ-12 data were obtained from the Health Survey for England 2004 cohort (n = 3705). Structural equation modelling was used to assess the fit of 1 the one-dimensional model 2 the current 'best fit' three-dimensional model and 3 a one-dimensional model with response bias. Three different scoring methods were assessed for each model. The best fitting model was assessed for reliability, standard error of measurement and discrimination. Results The best fitting model was one-dimensional with response bias on the negatively phrased items, suggesting that previous GHQ-12 factor structures were artifacts of the analysis method. The reliability of this model was over-estimated by Cronbach's Alpha for all scoring methods: 0.90 (Likert method), 0.90 (GHQ method) and 0.75 (C-GHQ). More realistic estimates of reliability were 0.73, 0.87 and 0.53 (C-GHQ), respectively. Discrimination (Delta) also varied according to scoring method: 0.94 (Likert method), 0.63 (GHQ method) and 0.97 (C-GHQ method). Conclusion Conventional psychometric assessments using factor analysis and reliability estimates have obscured substantial measurement error in the GHQ-12 due to response bias on the negative items, which limits its utility as a screening instrument for psychiatric morbidity.

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1471-2458-8-355.pdf

The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions

Matthew Hankins 0 1 2 0 Brighton & Sussex University Hospitals NHS Trust, Royal Sussex County Hospital , Brighton , UK 1 Department of Primary Care & Public Health, Brighton & Sussex Medical School , Brighton , UK 2 King's College London, Department of Psychology (at Guy's), Institute of Psychiatry , London , UK Background: The twelve-item General Health Questionnaire (GHQ-12) was developed to screen for non-specific psychiatric morbidity. It has been widely validated and found to be reliable. These validation studies have assumed that the GHQ-12 is one-dimensional and free of response bias, but recent evidence suggests that neither of these assumptions may be correct, threatening its utility as a screening instrument. Further uncertainty arises because of the multiplicity of scoring methods of the GHQ-12. This study set out to establish the best fitting model for the GHQ-12 for three scoring methods (Likert, GHQ and C-GHQ) and to calculate the degree of measurement error under these more realistic assumptions. Methods: GHQ-12 data were obtained from the Health Survey for England 2004 cohort (n = 3705). Structural equation modelling was used to assess the fit of [1] the one-dimensional model [2] the current 'best fit' three-dimensional model and [3] a one-dimensional model with response bias. Three different scoring methods were assessed for each model. The best fitting model was assessed for reliability, standard error of measurement and discrimination. Results: The best fitting model was one-dimensional with response bias on the negatively phrased items, suggesting that previous GHQ-12 factor structures were artifacts of the analysis method. The reliability of this model was over-estimated by Cronbach's Alpha for all scoring methods: 0.90 (Likert method), 0.90 (GHQ method) and 0.75 (C-GHQ). More realistic estimates of reliability were 0.73, 0.87 and 0.53 (C-GHQ), respectively. Discrimination (Delta) also varied according to scoring method: 0.94 (Likert method), 0.63 (GHQ method) and 0.97 (C-GHQ method). Conclusion: Conventional psychometric assessments using factor analysis and reliability estimates have obscured substantial measurement error in the GHQ-12 due to response bias on the negative items, which limits its utility as a screening instrument for psychiatric morbidity. - Background The twelve-item General Health Questionnaire (GHQ-12) is intended to screen for general (non-psychotic) psychiatric morbidity [1]. It has been widely used and, as a result, translated into many languages and extensively validated in general and clinical populations worldwide [2]. The validation process has been principally psychometric in nature, focusing on the reliability and validity of the data generated, with additional support coming from studies of the sensitivity and specificity of the measurement [2,3]. Despite this, the utility of using self-report measures such as the GHQ-12 has been questioned, with a recent review concluding that clinicians may find the low positive predictive value of this method unconvincing as a diagnostic aid [4]. This raises the question of whether psychometric validation alone is a sufficient basis for adopting the GHQ-12 as a screening instrument in clinical practice. In clinical practice, poor positive predictive value means that many of those screening positive are not suffering from a psychiatric disorder but may be deemed to warrant further investigation; in a research context it means that many participants will be misclassified, a form of measurement error that will bias subsequent analyses [5]. In classical test theory, a test or questionnaire is assessed for dimensionality, reliability and validity [6]. Dimensionality is assessed using factor analysis, a method based on the pattern of correlations between the questionnaire item scores. If all items share moderate to strong correlations, this produces a single 'factor' and suggests that the scale measures a single dimension. Several groups of such items produce several factors, suggesting that several dimensions are being measured. Since the method depends on the inter-item correlations, anything that produces correlated items will be interpreted as a factor, and therefore caution should be exercised when interpreting factor structures as substantive dimensions [6]. Reliability is an estimate of the degree of measurement error entailed in the measurement of a single dimension by several items. If a questionnaire measures several dimensions, then each requires an estimate of reliability. Several methods are commonly used to estimate reliability (for example, Cronbach's Alpha or test-retest correlations), but all rely on the correlation between items (Alpha) or scale scores (test-retest). In addition, the interpretation of the resulting reliability coefficient depends on some strong assumptions being met: most notably in the context of the current study, there is the assumption that the measurement error of each item is random (i.e. uncorrelated with anything else). Finally, validity refers to the extent to which the test or questionnaire measures what it is supposed to measure. This is commonly assessed with reference to some external criterion, but it should be clear that a questionnaire intended to measure a single dimension cannot be valid if it measures several dimensions, or if it produces data with a high proportion of measurement error. Hence, factor analysis and reliability estimates contribute to the sufficiency of a measure, but do not guarantee it. While psychometric evaluation of the GHQ-12 suggests that it is a valid measure of psychiatric morbidity (i.e. it measures what it purports to measure), and also a reliable measure (i.e. measurement error is low), examination of the factor structure has repeatedly led to the conclusion that the GHQ-12 measures psychiatric morbidity in more than one domain [7]. These results have been interpreted as evidence that the GHQ-12 measures more than one dimension of psychiatric morbidity, although typically each dimension has been found to be reliable and the measurement error for each dimension acceptable. Currently the consensus appears to be that the GHQ-12 measures psychiatric dysfunction in three domains, social dysfunction, anxiety and loss of confidence [7-9], although having been derived solely from factor analysis, both the utility and the clinical ontology of these domains remains unclear [10]. Another interpretation of this factor analytic evidence is that the apparent multidimensional nature of the GHQ12 is simply an artefact of the method of analysis, rather than an aspect of the GHQ-12 itself [10]. The studies reporting that the GHQ-12 is multidimensional used either exploratory factor analysis (EFA) or confirmatory factor analysis by structural equation modelling (SEM), and it has long been known that these methods can produce spurious dimensions even when the measure in question is one-dimensional if the questionnaire c (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2458-8-355.pdf
Article home page: http://www.biomedcentral.com/1471-2458/8/355

Matthew Hankins. The reliability of the twelve-item general health questionnaire (GHQ-12) under realistic assumptions, BMC Public Health, 2008, pp. 355, 8, DOI: 10.1186/1471-2458-8-355