Commentary: Reporting standards are needed for evaluations of risk reclassification
Published by Oxford University Press on behalf of the International Epidemiological Association
ß The Author 2011; all rights reserved. Advance Access publication 13 May 2011
International Journal of Epidemiology 2011;40:1106–1108
doi:10.1093/ije/dyr083
Commentary: Reporting standards are needed
for evaluations of risk reclassification
Margaret S Pepe1,2* and Holly Janes1
1
Biostatistics and Biomathematics Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA and 2Biostastics
Department, University of Washington, Seattle, WA, USA
*Corresponding author. Biostatistics and Biomathematics Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98109,
USA. E-mail:
14 April 2011
New approaches have been developed in recent years
to quantify the improvement in prediction performance gained by adding a novel marker to a set of
baseline predictors of risk. The paper by Tzoulaki
et al.1 concerns risk reclassification techniques and
focuses specifically on the net reclassification improvement (NRI) index. Their review shows that use
of risk reclassification analysis is extremely common
in practice, with 51 papers using the technique published in only 3 years since its introduction.
Unfortunately and alarmingly, the review shows that
the quality of reporting is dismal. Investigators seem
confused about the roles and interpretations of risk
reclassification metrics. Guidance on how to report
results of risk reclassification analysis would be helpful to authors, reviewers and the field in general.
The risk reclassification table was first introduced by
Cook.2 The table is constructed by choosing clinically
meaningful risk categories and cross-classifying individuals according to their risks calculated with the
baseline risk model and with the expanded risk
model. The top panel of Table 1 provides an illustration. Cook and Ridker3 developed a whole analysis
strategy around the risk reclassification table including new hypothesis tests and a new metric called ‘percent correct reclassification’. However, the value of
these analysis techniques is doubtful and results can
be misleading.4 Pencina et al.5 argued that the reclassification table itself was problematic, at least as proposed by Cook, because it did not distinguish between
subjects with events (cases) and subjects without
events (controls). They suggested constructing separate event and non-event reclassification tables as
shown in the middle and bottom panels of Table 1.
Entries above the diagonal correspond to risks that
are higher with the expanded vs baseline model, representing improved prediction for subjects with
events. Correspondingly, entries below the diagonal
represent worse prediction for them. The event-NRI
is the difference between the proportions of subjects
above vs below the diagonal in the event reclassification table. Using a similar logic, the non-event-NRI is
calculated from the non-event reclassification table by
taking the difference between the proportions of subjects below vs above the diagonal. The NRI summary
index that gained immediate popularity in the literature following Pencina’s paper is the sum,
NRI ¼ event-NRI þ non-event-NRI:
A prerequisite for considering risk reclassification is
that the risk models are well calibrated, in the sense
that the observed event rates for subgroups defined by
the predictors in the models are close to the values
calculated from the models. A poorly calibrated risk
model is considered invalid for calculating risk as a
function of the modelled predictors. It is of great concern therefore that almost half of the papers reporting
risk reclassification results do not report assessment
of model calibration. A second basic premise for
considering risk reclassification is that the chosen
categories of risk are clinically meaningful in the
sense that changing risk categories has clinical consequences. The review indicates that only 27% of papers
provided justification for the particular risk categories
used. This is a very poor state of affairs.
Even if the risk models are valid and the risk categories chosen are clinically relevant, is NRI a good
way of summarizing improvement in risk reclassification performance? We do not find this single numeric
summary very enlightening. Calculated as 17.4% in
our example, the NRI seems to fall short of the task
of gauging whether or not a substantial improvement
has been obtained. Somewhat more revealing are its
components, event-NRI and non-event-NRI. If only
two risk categories were involved, the event-NRI is
the increase in the proportion of subjects with
events that are classified as high risk by the predictors
and correspondingly the non-event-NRI is the increase in the proportion of subjects without events
who are deemed at low risk. These are simple useful
1106
Accepted
EVALUATIONS OF RISK RECLASSIFICATION
Table 1 Illustration of risk reclassification tables
Expanded model
Baseline model
0–5%
All subjects (n ¼ 10 000)
5–20%
0–5%
5558
437
5–20%
420%
Total
25
6020
1036
1095
386
2517
420%
40
329
1094
1463
Total
6634
1861
1505
10000
Events only (n ¼ 1017)
72
38
4
114
5–10%
21
105
114
240
420%
0
33
630
663
Total
93
176
748
1017
Non-events only (n ¼ 8983)
0–5%
5486
399
21
5906
5–20%
1015
990
272
2277
420%
40
296
464
800
Total
6541
1685
757
8983
The original table proposed by Cook2 included all subjects (top
panel). Pencina et al.5 proposed separate tables for events and
non-events (middle and bottom panels). Event-NRI ¼ 10.0%,
non-event-NRI ¼ 7.4% and NRI ¼ 17.4%.
summaries. However, with more than two categories
the interpretations are far less appealing because all
upward movements of risk category are counted
equally and all downward movements are counted
equally. Yet, the clinical implications are usually not
equal. For example, moving from the lowest to highest or moving from the lowest to intermediate categories typically has very different consequences.
Perhaps a single numeric summary is not needed or
at least should not be the main focus of analysis. One
alternative suggestion is to report the net changes in
proportions of subjects classified in each of the risk
categories.6 These are (2.1%, 6.3%, 8.4%) for subjects with events and (7.1%, 6.6%, 0.5%) for subjects without events in Table 1. In other words, of
subjects with events, 8.4% more are in the high-risk
category and 2.1% fewer are in the low-risk category,
whereas of subjects without events, 7.1% more are in
the low-risk category and 0.5% fewer are in the
high-risk category. These are simple summaries of
reclassification performance that seem more clinically
relevant than the NRI index of 17.4%.
Although risk reclassification analysis with the NRI
has taken off like wildfire in applications, it is not yet
a highly developed rigorous statistical technique.
Unfortunately, this point is not widely appreciated
and it is not acknowledged in the review. In particular, statis (...truncated)