The Case for Using the Repeatability Coefficient When Calculating Test–Retest Reliability
Andreou P (2013) The Case for Using the Repeatability Coefficient When Calculating Test-Retest
Reliability. PLoS ONE 8(9): e73990. doi:10.1371/journal.pone.0073990
Editor: Susanne Hempel
The Case for Using the Repeatability Coefficient When Calculating Test-Retest Reliability
Anne Elizabeth Passmore
The use of standardised tools is an essential component of evidence-based practice. Reliance on standardised tools places demands on clinicians to understand their properties, strengths, and weaknesses, in order to interpret results and make clinical decisions. This paper makes a case for clinicians to consider measurement error (ME) indices Coefficient of Repeatability (CR) or the Smallest Real Difference (SRD) over relative reliability coefficients like the Pearson's (r) and the Intraclass Correlation Coefficient (ICC), while selecting tools to measure change and inferring change as true. The authors present statistical methods that are part of the current approach to evaluate test-retest reliability of assessment tools and outcome measurements. Selected examples from a previous test-retest study are used to elucidate the added advantages of knowledge of the ME of an assessment tool in clinical decision making. The CR is computed in the same units as the assessment tool and sets the boundary of the minimal detectable true change that can be measured by the tool.
Reliability and Testretest reliability
Reliability refers to the reproducibility of measurements .
Measurements are considered reliable if they are stable over
time in stable subjects, show adequate levels of measurement
variability, and are sensitive (precise) enough to detect
Minimum Clinically Important Difference (MCID) [2,3]. Test
retest reliability or reproducibility is a method of estimating a
tools reliability by administering it to the same person or a
group of people, in the same way, on two or more different
occasions, hours or days apart . Testretest reliability
provides clinicians with assurance that the tool measures the
outcome the same way, in a stable client, each time it is used.
Better reproducibility suggests better precision of single
measurements, which is a requirement for better tracking of
changes in measurements in research or practice settings .
There are two necessary assumptions in testretest reliability.
The first is that the true score does not change between
administrations. The second is that the time period between
administration is long enough to prevent learning, carry-over
effects, or recall . An understanding of the stability or
variability in the outcome being measured, and characteristics
of participants involved in the reliability study should guide the
time interval between administrations.
Perfect testretest reliability scores are rare, as all
instruments respond with some error. Thus, any observed
score (O) can be assumed to have a true score (T) and an
error component (E) [O = T E] . Since it is impossible to
know T; the true reliability of any test is not calculable .
Reliability can be defined using the statistical concept of
variance. It is expressed as the ratio of the variance of T to the
variance of O . If the error component is large, then the ratio
(reliability coefficient) is close to zero, but it is close to one if
the error is relatively small. T is the measurement of a persons
actual ability or status, while O is the score readings provided
by the tool. For example, in the case of functional
independence, irrespective of assessment tool used, an
assumption is made that the client has a true functional
independence score which reflects his/her functional abilities
when perfectly measured . In theory, the same T would be
obtained if a client was assessed an infinite number of times
. Clinically, it is neither practical nor possible to take infinite
measurements; hence, it is impossible to know if the observed
score is in fact T. Practitioners make an assumption that a
single observation on a client (O) is an accurate estimate of the
clients T (i.e., O = T).
Testretest reliability is concerned with the repeatability of
observations made on or by individuals. It is assumed that O is
an accurate measurement of T. When a standardised tool is
used to measure an outcome, clinicians rely on the published
testretest reliability coefficient of the tool to guide the
confidence in their results.
Quantifying TestRetest Reliability
Relative reliability. Testretest reliability can be estimated
using relative and absolute indices . Relative reliability
estimates concern consistency or association of position of
individuals in a group, relative to others. Pearsons Product
Moment Correlation coefficient [Pearsons (r)] and the
Intraclass correlation coefficient (ICC) are the commonly used
relative reliability indices. These correlations quantify the
direction (+/-) and the strength of the relationship between test
retest scores by estimating their linear relationship, and lies
between +1 and -1 . Perfect correlation is one special case
of this, but r = +1 is not necessarily an indication of complete
agreement (interchangeability) between testretest scores. The
correlation coefficient is a reflection of how closely a set of
paired observations (testretest data in this case) follow a
straight line, regardless of the slope of the line. For example,
Figure 1 shows two fictional sets of data (black and red circles)
which both exhibit a similar linear relationship. The line of best
fit is the solid line in the graph and is the same for both
datasets, but the black circles sit much closer to the line than
the red circles, leading to a much higher correlation coefficient
(r = 0.99 and 0.84 respectively). Neither sets of circles are on
the line of complete agreement (represented by the dashed line
in the graph). The difference between correlation and
agreement has been eloquently described by Bland & Altman
(1999), and Figure 1 is a fictional example to demonstrate this
concept. Other authors have remarked on the use of the
correlation coefficient as the sole index of testretest reliability
The main drawback of Pearsons (r) value is that it does not
provide clinicians with any insight into systematic errors that
may be inherent in the measurement obtained with a specific
assessment tool. For example, as shown in the hypothetical
data set in Figure 1, Pearsons (r) gives a very high value of
0.99 for the black circles despite the divergence of the
measurements from the line of agreement. Clinicians may
mistake this excellent correlation for complete agreement
between the scores, which is clearly not the case.
Although it is still not a measure of absolute agreement, the
Intraclass correlation coefficient (ICC) is often reported in place
of the Pearsons (r). The ICC is used frequently to calculate the
correlation between more than two sets of measurements
(typically in the case of more than two clinicians completing an
assessment on a set of individuals) [2,10].
Unlike Pearsons (r), the Intraclass Correlation Coefficient
(ICC) accounts for both consistency of performances from test
to retest (within-subject change), as well as change in average
performance of participants as a group over time (i.e.,
systematic change in mean) [2,10,12]. There are numerous
versions of the ICC, each appropriate to specific research
design situations . Both Pearsons (r) and ICC are
influenced by how similar (or different) participants in the
research study scored to each other on the outcome being
measured (i.e., consistency in the research participants
scores) . All else being equal, the more similar participants
score to each other as a group (i.e., more homogeneous), the
smaller the magnitude of Pearsons (r) and ICC. The
magnitude of both Pearsons (r) and ICC is influenced by
outlier scores. When reading reliability studies, and before
selecting a tool for use, it is therefore important that
practitioners critically review the characteristics of the research
participants involved in the reliability estimation study. For
example, cognitive function scores of people with advanced
Alzheimers disease will be more similar to each other than
those of a group of people with various neurological conditions
at various times since diagnosis. Thus, practitioners need to
carefully ensure that the tool selected for use has been tested
on a sample group with similar characteristics.
Absolute reliability. Absolute reliability is concerned with
variability due to random error . Consequently, an absolute
reliability index is affected by the degree to which
measurements vary, with the premise being the less the
variability, the higher the reliability. For example, in the case of
goniometry, the margin of error is generally accepted to be 5
degrees for measurement of joints Range of Movement (ROM)
in the hand, provided the measurements are taken by the same
examiner using standardised techniques . This means that
while using a goniometer for hand ROM, scores that differ by
more than 5 degrees can be considered to reflect a real
The repeatability coefficient (CR) also referred to as the
Smallest Real Difference (SRD) is a useful index that quantifies
absolute reliability ME in the same units as the measurement
tool [2,10,11]. The CR of a tool is directly related to the 95%
Limits of agreement (LOA) proposed by Bland and Altman
(Figures 2 and 3) that contain 95% of differences between
repeated measurements on same subjects [2,10,11].
The CR is the value below which the absolute differences
between two measurements would lie with 0.95 probability
[16,17]. It is calculated by multiplying the within-subject
standard deviation (SW) or the Standard Error of Measurement
(SEM) by 2.77 ( 2 times 1.96). Thus, CR = 2.77Sw [2,11,17].
Both random and systematic errors are taken into account in
the CR score . For example, in the case of goniometry,
clinicians can be 95% confident that 10 degree change in hand
ROM represents 5 degree true change (because we are aware
that a goniometer has an established ME of 5 degrees).
Because the CR is quantified in the same units as the
assessment tool, it lends itself for easy clinical interpretation,
and can be used to guide decision making with individual
clients on a day to day basis.
To further test the case of the CR over Pearsons r and ICC
while selecting outcome tools to measures change and
inferring change as true, this paper uses testretest data from
a previous study using the Social Skills Rating System
Materials and Methods
A 4-week testretest design was used. The secondary level
student version of the SSRS-SSF was administered to 187
year 7 students (Mean age = 12 years 3 months, SD = 3.93
months) from five randomly selected schools in metropolitan
Perth, Western Australia. Detailed information on the present
studys methodology and results has been published elsewhere
This study is based on secondary data analysis of a prior
submission entitled Internal consistency, testretest reliability
and measurement error of the self-report version of the Social
Skills Rating System in a sample of Australian adolescents
. Informed written consent was obtained from school
principals, parents and involved students. In situations where
the student declined to participate, even with parental consent,
they were not included. Students were made aware that they
were not obliged to participate in the study, and free to
withdraw from this study at any time without justification or
prejudice. At all stages, the study conformed to the approved
National Health and Medical Research Council Ethics
Guidelines . Full ethics approval was obtained from Curtin
University Health Research Ethics Committee (Reference
number HR 194/2005).
Analyses were undertaken using SPSS version17 and SAS
Version 9.2 software packages. Testretest reliability
estimates, such as the Pearsons correlation coefficient (r),
two-way random effects model ICC (2,1) and the CR were
computed using standard formulae [2,11,17].
Indices of relative reliability
For purposes of illustration, retest estimates from the
empathy subscale for girls and assertion subscale for boys are
discussed. These subscales were chosen for purposes of
graphical emphasis, as participants mean scores differed
significantly across administrations.
As shown in Table 1, the 4-week ICC of the empathy
subscale (for girls) was 0.55 (Pearsons r = 0.55, n = 92) while
that of the assertion subscale (for boys) was 0.79 (Pearsons r
= 0.80, n = 84). If we refer to the thresholds suggested by
Vincent  that are typically recommended for individual
decision-making in a clinical setting, the social skills scale and
subscale testretest indices were too low to permit reliable use
of the SSRS-SSF in practice. However, if we refer to the lower
thresholds typically referenced by health professionals , the
assertion subscale testretest reliability in boys would be
considered to be excellent (r = 0.78), while the empathy
subscale retest reliability for Year 7 girls would fall into the fair
to good category (r = 0.54).
Indices of absolute reliability
As shown in Table 1, as a group, boys had higher assertion
scores on retest (Mean bias = 0.66, p = .005), despite the
expectation that there should be no significant change in
assertion scores over 4-weeks (see discussion section) .
The reliability coefficient (CR) of the assertion subscale for
Year 7 boys was 4.21 [2,11]. The observed bias on the
assertion subscale (0.66 units) was within the subscales ME (
4.21). In the case of girls, there was no statistical change in
mean empathy scores over time (Mean bias = -0.38; p = 0.06
and the ME of the empathy subscale was 3.81 units.
As presented in Table 1, the 4-week ICC for the empathy
subscale (for girls) was 0.53 while that of the assertion
Figure 3. Bland and Altman difference plots using girls Times 1 and 2 empathy frequency scores on the SSRS-SSF. .
Relative and absolute reliability indices
84 13.24 3.11 13.90 3.10 0.89 0.78 0.77
74 12.86 3.07 13.27 3.07 0.84 0.72 0.72
98 14.44 2.95 13.95 3.06 0.78 0.62 0.62
92 16.66 1.93 16.27 2.04 0.71 0.54 0.53
ICC2, 1 Intraclass correlation coefficient: two-way random effect model (absolute agreement definition)
95% LOA LB (95% CI of the LOA) = Bland and Altman 95% Limits of agreement Lower Boundary (95% Confidence intervals of the limits of agreement)
95% LOA UB (95% CI of the LOA) = Bland and Altman 95% Limits of agreement Upper Boundary (95% Confidence intervals of the limits of agreement)
CR = 2.77 SEM 
subscale (for boys) was 0.77. This means that 53% of variance
in the observed empathy scores is attributable to variance in
the true score, after adjustment for any real change over time
or inconsistency in subject responses over time. The remaining
47% of the observed score variation in either Time 1 or Time 2
represents error, if we assume that no real change would have
occurred in the outcome over this short time period. In the
above example, based on the magnitude of the ICC of the
empathy subscale, a clinician would be cautious in using the
SSRS-SSF to measure change in empathy skills in another
Year 7 Australian student, due to low confidence that empathy
scores on reassessment reflect baseline scores . Because
Pearsons (r) and ICC are expressed in scale format and not
the same units of measurement as the tool , they have
limited clinical applicability beyond highlighting the
psychometric rigor with which a tool measures testretest
Staying with the above example, if one looks closely at the
SD of the empathy and assertion subscales, we note that girls
empathy scores were less spread around the mean (M1=
16.66, SD1= 1.93; M2= 16.27, SD2= 2.04), when compared to
boys assertion scores (M1= 13.24, SD1= 3.11; M2= 13.90, SD2=
3.10). So, as a group, girls scored more homogeneously, than
boys did on assertion behaviours (more heterogeneous). The
wider spread of boys scores on the assertion subscale
resulted in the magnitude of Pearsons (r) and ICC being
greater [2,12,21]. Even a high value does not provide the
surety of testretest scores being interchangeable.
Practitioners should therefore be extremely judicious in
selecting an outcome tool based on the reported testretest
scores as they could be misleading. Additionally, Pearsons (r)
and ICC do not quantify the unaccounted variation in scores in
the measurement scale of the outcome measure (i.e., they do
not explain the unaccounted 47% of empathy scores in the
measurement units of the empathy subscale), so the clinical
interpretation of their score is limited.
We computed the coefficient of repeatability (CR) or the
Smallest Real Difference (SRD) to index the measurement
error or the smallest possible change in subscale and total
social skills scale scores that represents true/real change
[2,10,11,17]. The CR accounts for both random and systematic
error in its scores. In the above example, the CR of the
empathy subscale for girls was therefore smaller ( 3.81) than
that for boys on the assertion subscale ( 4.21). Based on the
CR, a practitioner using the SSRS-SSF empathy subscale with
a year 7 girl in Australia would need to see a change of at least
3.81 units at re-assessment to be 95% confident that the girl
had, in fact, benefited from the intervention. A change of less
than 3.81 might simply be due to the inherent mechanical
inaccuracy of the empathy subscale, which is unable to reliably
detect change of less than 3.81 units. The abovementioned
example demonstrates the advantage of considering ME as
computed by the CR over Pearsons (r) or ICC.
The statistically significant difference in mean score as a
marker of change (as measured by the t-statistic), needs to be
interpreted in the light of the tools rather larger ME. Based on
the t-statistic presented in Table 1, boys were found to be more
assertive on retest (Mean bias = 0.66; p = 0.005) . The
interpretation is that there was evidence of an increase in this
scale from baseline to the 4-week retest, and this increase was
statistically significantly different from zero. The CR (4.21)
means that any individual boy is expected to give readings on
this scale which are within 4.21 units of this bias. The bias
would always need to be interpreted in terms of its clinical
significance, and it may be that a difference of this amount,
being small relative to the range of values expected for the
individual, is of limited interest . In terms of an implication,
the statement of the tools ME assists the researcher who
wishes to develop some intervention addressing the tools
outcome. The tools ME is essential in order to calculate the
sample size required to demonstrate an effect of an
intervention that is intended to reach some clinically relevant
To date, there exists no consensus on what the acceptable
value of a correlation coefficient ought to be to inform tool
selection [4,12]. Tool developers often cite Shrout and Fleiss
study on reliability to support claims that a clinically acceptable
correlation is 0.75 or 0.80 or greater . Shrout and Fleiss
categorisation is critiqued in the sports sciences and medicine
because they did not assess the utility of the recommended
correlations . It has been suggested that as a general rule, a
value of over 0.90 should be considered high, between 0.80
and 0.90 as moderate, 0.80 and below insufficient while using
an instrument for individual decision-making . Whilst such a
conservative stance has been adopted in sports and medical
sciences, it seems that sociological and behavioural scientists
use lower relative reliability thresholds . As highlighted by the
illustrations in this paper, even a high correlation value
between testretest scores could be misleading. Best practice
guidelines in measurement literature recommend use of both
relative and absolute reliability ME indices [2,10,12].
Unlike relative reliability indices, to date there is no formulaic
approach to benchmark ME. This means that there exists no
statistical method to decide whether a ME of 4.21 in relation
to the range of scores on the assertion subscale (assertion
subscale range = 0 - 20 units) is wide or small. Thus, although
ME sets the boundaries of the minimal delectable true change
of an outcome measure, it holds limited clinical importance
beyond that function.
ME helps clinicians decide on a best practice level whether
the observed change in a clients performance is true . The
value of the ME does not provide insight into the critical clinical
question of How large should change in an outcome be, to be
deemed clinically important (i.e., to have an impact on patient
care)? The latter is determined by a statistic called the
Minimum Clinically Important Difference of a tool (MCID). The
MCID is related to responsiveness or the ability of a tool to
detect clinically relevant changes over time and is decided on
clinical grounds (and not based on statistical analysis) . An
outcome measurement that shows high ME (i.e., variability
within stable subjects) would be considered to have poor
responsiveness. In that regard, reproducibility (testretest
reliability) is a necessary condition of responsiveness.
It is vital that the ME of an outcome tool be corroborated
against its MCID before clinicians decide to use the tool to
measure change in an intervention study. For example,
Schuling et al. , reported no changes in sickness impact
profile scores (SIP) during the first 6-months post-stroke; but
during the same period the Barthel Activities of Daily Living
scores changed significantly. The authors concluded that the
SIP was not responsive (sensitive) enough to detect modest
improvement in a consecutive cohort of acute stroke patients
. In practice, if the ME of an outcome measure is wider than
its MCID (i.e., CR > MCID), it is likely that the outcome
measure will mask functional change [2,11] and inaccurately
report that there is no true change in outcome when in fact the
intervention had been effective (i.e., the occurrence of a Type II
error or false negative).
Conceived and designed the experiments: SV RP AEP.
Performed the experiments: SV. Analyzed the data: SV RP PA.
Contributed reagents/materials/analysis tools: SV AEP. Wrote
the manuscript: SV RP TF AEP PA. Critically reviewed
submission: TF RP.
1. Portney LG , Watkins MP ( 2000 ) Foundations of clinical research: Applications to practice . Upper Saddle River, NJ: Prentice Hall.
2. Lexell JE , Downham DY ( 2005 ) How to assess the reliability of measurements in rehabilitation . Am J Phys Med Rehabil 84 : 719 - 723 . doi:10.1097/01.phm. 0000176452 .17771.20. PubMed: 16141752 .
3. Rothstein JM ( 1985 ) Measurement and clinical practice: Theory and application . In: JM Rothstein . Measurement in Physical Therapy . New York : Churchill Livingstone.
4. Hopkins WG ( 2000 ) Measures of reliability in sports medicine and science . Sports Med 30 : 1 - 15 . doi: 10.2165/ 00007256 - 200030010 -00001. PubMed: 10907753 .
5. Allen MJ , Yen WM ( 1979 ) Introduction to measurement theory . Monterey (CA): Brooks/Cole.
6. Bruton A , Conway JH , Holgate ST ( 2000 ) Reliability: What is it and how is it measured? Physiotherapy 86 : 94 - 99 . doi:10.1016/ S0031-9406(05)61211- 4 .
7. Deusen JV , Brunt D ( 1997 ) Assessment in Occupational Therapy and Physical Therapy Philadelphia , Pennsylvania. W. B. Saunders Company.
8. Baumgarter TA ( 1989 ) Norm-referenced measurement: Reliability . In: MJ SafritTM Wood. Measurement concepts in physical education and exercise science . Champaign (IL): Human Kinetics . pp. 45 - 72 .
9. Bland JM , Altman DG ( 1996 ) Statistics Notes: Measurement error and correlation coefficients . BMJ 313 : 41 - 42 . doi:10.1136/bmj.313.7048.41. PubMed: 8664775 .
10. Bland JM , Altman DG ( 2003 ) Applying the right statistics: Analyses of measurement studies . Ultrasound Obstet Gynecol 22 : 85 - 93 . doi: 10.1002/uog.122. PubMed: 12858311 .
11. Beckerman H , Roebroeck ME , Lankhorst GJ , Becher JG , Bezemer PD et al. ( 2001 ) Smallest real difference: A link between reproducibility and responsiveness . Qual Life Res 10 : 571 - 578 . doi:10.1023/A: 1013138911638. PubMed: 11822790 .
12. Atkinson G , Nevill AM ( 1998 ) Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine . Sports Med 26 : 217 - 238 . doi:10.2165/ 00007256 - 199826040 -00002. PubMed: 9820922 .
13. Shrout PE , Fleiss JL ( 1979 ) Intraclass Correlations: Uses in Assessing Rater Reliability . Psychol Bull 2 : 420 - 428 . PubMed: 18839484 .
14. Bear-Lehman J , Abreu BC ( 1989 ) Evaluating the Hand: Issues in Reliability and Validity . Physical Therapy 69 : 1025 - 1033 . PubMed: 2685841 .
15. Vaz S , Parsons R , Passmore AE , Andreou P , Falkmer T ( 2013 ) Internal consistency, test-retest reliability and measurement error of the selfreport version of the Social Skills Rating System in a sample of Australian adolescents . PLOS ONE.
16. Standard British Institution ( 1979 ) Precision of test methods1: Guide for the determination and reproducibility of a standard test method (BS5497 , part1). London: BSI.
17. Bland JM ( 2000 ) An introduction into medical statistics . Oxford: Oxford University Press.
18. National Health and Medical Research Centre . NHMRC] ( 2005 ) Human research ethics Handbook: A research law collection
19. Vincent WJ ( 1999 ) Statistics in Kinesiology Champaign, IL: Human Kinetic .
20. Rankin G , Stokes M ( 1998 ) Reliability of assessment tools in rehabilitation: An illustration of appropriate statistical analyses . Clin Rehabil 12 : 187 - 199 . doi:10.1191/026921598672178340. PubMed: 9688034 .
21. Bland JM , Altman DG ( 1999 ) Measuring agreement in method comparison studies . Stat Methods Med Res 8 : 135 - 160 . doi: 10.1191/096228099673819272. PubMed: 10501650 .
22. Meyers LS , Gamst G , Guarino AJ ( 2006 ) Applied multivariate research: Design and implication. CA: Sage Publications , Inc.
23. Guyatt G , Walter S , Norman G ( 1987 ) Measuring change over time: Assessing the usefulness of evaluative instruments . J Chronic Dis 40 : 171 - 178 . doi:10. 1016/0021-9681(87)90069-5. PubMed: 3818871.
24. Schuling J , Greidanus J , Jong BM-D ( 1993 ) Measuring functional status of stroke patients with the Sickness Impact Profile . Disabil Rehabil 15 : 19 - 23 . doi:10.3109/09638289309165864. PubMed: 8431587 .