The sensitivity of TIMSS country rankings in science achievement to differences in opportunity to learn at classroom level
Daus and Braeken Large-scale Assess Educ
The sensitivity of TIMSS country rankings in science achievement to differences in opportunity to learn at classroom level
Background: Fair comparisons of educational systems in large-scale assessments can be made only if the differences in curricula have little impact on the outcomes. This study investigated the sensitivity of science achievement rankings to varying degrees of curriculum implementation in the Trends in International Mathematics and Science Study (TIMSS). Methods: Country-specific teacher-reported curriculum implementation profiles across the TIMSS science domains were charted including their within-country variability across the classrooms for 33 participating countries of TIMSS 2015. A sensitivity test compared the original ranking to TIMSS curriculum implementation scenarios (a leastpossible, a most-possible, and more realistic country-specific median implementation). Results: In contrast to expectations, no support was found for a positive relationship between opportunity to learn and science achievement at the between-country level or the within-country level, with only minor exceptions. The sensitivity analysis under different curriculum implementation scenarios also suggests little impact on the rank order of the countries. Conclusions: Plausible explanations for this null finding are addressed; attention and research efforts should focus on improving the quality of curriculum implementation indicators in large-scale assessments.
Curriculum implementation; Country rankings; TIMSS; Science achievement
The recent move by Norway to shift its tested population on the Trends in International
Mathematics and Science Study (TIMSS) 2015 from grade 4 to grade 5 and from grade
8 to grade 9 might seem a bit surprising. Since most of the participating countries test
their eighth-grade pupils, why does Norway want its tested population to be
out-ofgrade? Norway justifies this move by noting that the Norwegian first grade corresponds
to pre-school in most other countries. This means that, in terms of years of schooling,
the Norwegian ninth grade might be more comparable to the TIMSS eighth-grade target
population than Norwegian eighth graders would be.
© The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.
As the international association for the evaluation of educational achievement (IEA)
originally intended to use the world as a big educational laboratory (Husén 1973, as cited
in Comber and Keeves 1973), its large-scale assessments were deeply rooted in a need
for comparisons on equal and fair terms. Researchers and policy-makers have adhered
to this principle when using international large-scale assessments such as the IEA’s
TIMSS to compare educational systems. Hence, the assessment framework in TIMSS
is centered around a shared curriculum across the participating countries
From this perspective, curriculum implementation, focus, and sequencing would be
crucial for valid and contextualized interpretations of correlations between educational
inputs and outcomes.
In the late 60s, the IEA established an influential interpretation of curriculum
alignment that considers the intended, implemented, and attained curriculum
. Whereupon the intended curriculum is obtained from the national
standards, the implemented curriculum is obtained from teachers at the classroom level,
and the attained curriculum is obtained from the pupils’ achievement data. Up until
the Third International Mathematics and Mathematics Study (1995), a vast amount of
information on curriculum alignment was collected. Although less attention has been
given to collecting such information in the recent TIMSS cycles, such information is
still collected and remains relevant with today’s attention toward country comparisons
and rankings. A particular concern within curriculum alignment research is whether the
pupils being tested have had opportunities to learn the tested material, which remains a
challenge in international educational surveys.
With more than 40 countries participating in TIMSS, it should come as no surprise
that most countries deviate from the commonly agreed-upon curriculum-based
assessment framework. For instance, only half of the participating countries have covered
reproduction, heredity and genetics, and human health by grade 8
(Mullis et al. 2016, p.
. These country-specific deviations are almost guaranteed when there is an attempt
to merge the curricula of the participating countries into the framework, while ensuring
that the framework’s two-dimensional content-by-cognitive-demand blueprint matrix
is filled with enough valid and reliable items
. This raises the question of
to what extent such country-specific opportunity to learn deviations impact the country’s
achievement scores and rankings, which are used by educational policy-makers and often
reach the news headlines.
Hencke et al. (2009
) investigated what would happen to the TIMSS 2003 achievement
scores in mathematics when accounting for which items had, and had not, been covered
in the respective country’s intended curriculum. The countries’ mathematics
achievement scores were recomputed based only on the items listed as covered for a country,
and consequently correlated with the original achievement scores. Repeating this
procedure for each country’s list of covered items showed that these correlations between
the original mathematics scores and the intended-curriculum adjusted mathematic
scores were very high. The authors concluded that “even if countries had selected the
items covered in their intended curriculums, we would have found no statistically
significant effects across the countries’ international standings” (p. 111). This robustness of
the achievement country rankings might not come as a total surprise as most items are
developed and assembled after being approved by the participating countries, resulting
in a relatively large common denominator in the item pool. However, some caution
should be in place as there are some clear limitations in the curriculum indicator used to
operationalize coverage of the item content.
Coarse‑grained intended curriculum information
When Hencke et al. recomputed the country scores, they based their analysis on the
intended curriculum information from the TIMSS curriculum matching analysis
(TCMA). The TCMA intended curriculum data is completed by each country’s National
Research Coordinator for TIMSS who must struggle with coarse-grained curriculum
information. For instance, regarding TIMSS 2015, only 9 of 40 countries had a
nationally-specified intended science curriculum for grade eight, or a grade range that ended
in grade eight (see Table 1, the “intended science curriculum grade range” [ICGR]
variable), whereas the test was conducted at the end of grade eight
(Mullis et al. 2016)
Moreover, it is important to note that even those countries with a national curriculum exhibit
wide variation in the level of prescription, ranging from a very detailed and prescribed
curriculum in countries like England, to a much higher level and less detailed national
curriculum as in Australia. Consequently, in most of the countries involved, the data on
whether the national curriculum covered an item in the period leading up to the
assessment relied on expert judgement or textbook analyses, generalized to the entire country.
Differences in educational systems
Focusing on life science,
Matsubara et al. (2016
) compared the fourth-grade intended
curriculum of Japan with that of the international average in TIMSS 2011, and related
the findings to the relevant percent correct for the items. They then proposed changes
to the Japanese science curriculum. This is a reasonable approach in Japan which has
a relatively centralized system with statewide-prescribed learning objectives,
instructional methods, and materials for science and mathematics, as well as specified
learning objectives for each grade (1–2, 3, 4, 5, 6, 7, and 8). Yet, 32 of the 56 participants
for fourth grade in TIMSS 2015 reported a lack of statewide-prescribed instructional
methods and materials in science
(Mullis et al. 2016)
. In countries where there is more
autonomy in the educational system, instructional materials such as textbooks will vary
across authors and schools, and not all teachers will implement the intended curriculum
to the same extent.
To supplement the perspective offered by the system-level intended curriculum
indicator, we propose to move to a class-level implemented curriculum indicator.
Opportunity to learn as measured at the implementation level has usually included whether the
content was taught and how much it was covered, typically in terms of percentage of
class time. Some authors have attempted to include cognitive aspects and the quality of
instruction as well. However, such expansions of the construct risk crossing into
(Scheerens 2016, p. 20)
, in itself a large construct. Although opportunity
to learn is intuitively expected to have a relatively strong association with pupil
achievement, studies have not investigated how sensitive country-level scores and rankings are
to differences in this classroom-level opportunity to learn indicator.
The purpose of this paper is thus to investigate how sensitive the country achievement
scores and rankings are to opportunity to learn differences at the classroom level. We
chose the science component of TIMSS 2015 as a case study. There are generally many
more studies involving mathematics (or language) as outcome
of which have found a significant relationship between the implemented curriculum and
achievement within and between many countries in the mathematics data of TIMSS
1995, 2011 and 2015
(e.g. Luyten 2016; Schmidt et al. 2001, 2015)
. The lack of studies
in science suggests that science might be a less well-behaved subject to investigate.
Furthermore, whereas curriculum topics in mathematics can be considered relatively
“universal”, certain curriculum topics in science might be taught or omitted conditional on
the available natural resources, topography, or climate in a specific country. We begin by
charting the country-specific opportunity to learn profiles across the TIMSS 2015
science domains and their variability across the classrooms. We then investigate, between
and within countries, how achievement and opportunity to learn relate. Finally, we
conduct a sensitivity test to verify the robustness of TIMSS science country rankings when
considering different opportunity to learn profiles.
The TIMSS 2015 science data for grade 8 (or equivalent) were analyzed, excluding
benchmarking educational systems and countries with more than 50% missing values
on the curriculum information predictor variable for the overall subject and the
content domains. Many missing responses could be due to the teachers in that country
not being presented with the questions, as was the case with the Russian Federation
and Kazakhstan. Thus, 33 out of 40 countries were included. Table 1 shows the
country ISO-alpha codes used in subsequent tables and figures, the sample sizes of schools,
teachers, classes, and pupils across countries, whether it is included in the analysis, and
the intended science curriculum grade range (ICGR). In the TIMSS sampling design,
schools were randomly sampled, and entire classes with teachers were sampled within
The TIMSS science assessment framework’s two-dimensional blueprint consists of a
cognitive dimension that includes knowing, applying, and reasoning; and a content
dimension that includes biology, chemistry, earth science, and physics. The latter four
content domains are further divided into a total of 18 topics (e.g., Ecosystems, Light and
Sound, or Chemical Change).
Opportunity to learn in the classroom was operationalized through a TIMSS
implemented curriculum score (TICS). TIMSS contains teacher responses on which of the
18 science topics the class has covered earlier than the present year, during the present
year, or not yet or just introduced. The teacher responses to whether and when each of
the topics was taught were dummy coded into 1 (taught this year or taught before this
year) and 0 (not yet taught or just introduced). Two topics were surveyed by an indicator
pair, and the two indicators were consequently averaged. To treat classes with multiple
and single science teachers alike, we identified the maximum value for each topic across
the pupil’s teachers. The final measure (the TICS) was obtained by averaging across
topics (within a domain, for a domain TICS) for each pupil. The TICS represents a coverage
ratio (0–1), where 0 indicates that none of the content topics that the TIMSS items relate
to were covered by the teacher in class and 1 implies that all the content topics were
covered. The same interpretation holds for the science domains, which vary in their number
of implemented curriculum indicators: biology (7), chemistry (6), earth science (4), and
TICS was negatively skewed, so suitable robust statistics for central tendency and
spread of skewed variables, such as the median (Mdn), the median absolute deviation
(MAD), and absolute range (range = max − min), were used in descriptive statistics.
To ensure comparability with the international reports, we followed the design-based
statistical inference approach using plausible-value estimation of the science
achievement and science domain achievement measures accounting for TIMSS sampling design
features through total pupil sampling weight in combination with replicate weights to
obtain proper standard errors. Two models were fitted for each of the science domains
(including science overall). As a baseline reference, an unconditional multigroup model
was fitted to the TIMSS science achievement plausible values that reproduced the
country rankings of the international TIMSS report. A conditional multigroup model, with
science achievement regressed upon TICS, was used to investigate the impact of
opportunity to learn.
Statistical analysis robustness checks
The sensitivity of the TICS recoding was explored with an alternative dummy coding
of the teacher responses to whether and when each of the topics was taught where 1
indicated it was taught this year and 0 indicated it was taught before this year, not yet
taught, or just introduced. As some schools may be influential outliers, identified as
having a Cook’s distance D > 4/n (Bollen and Jackman 1990), the main conditional model
was rerun without influential outlier schools. Linearity of the relationship between TICS
and achievement was explored by the addition of a quadratic TICS term to the
regression model and through residual plots.
Predicted score and rank
TICS-adjusted country achievement scores and ranks were computed based on the
parameter estimates of the conditional models. Next to providing the original rank
scenario (O), a least-possible TICS-adjusted score scenario (Zero) and a
most-possible TICS-adjusted score scenario (Full) were provided for comparing countries on an
equal footing, and a country-specific median TICS-adjusted score scenario (Med) was
provided for a more realistic comparison conditional on each country’s observed TICS
values. The country-level median achievement rank of these TICS-adjusted
predictions (with corresponding 95% inferential uncertainty intervals) were reported.
Simulated sampling distributions for statistics of interest were derived through 5000 Monte
Carlo draws from a multivariate normal distribution with mean vector set to the point
estimates of the regression parameters and variance–covariance matrix set to their
estimated variance–covariance matrix. The free statistical software environment R
was used in combination with Mplus 8 (
Muthén and Muthén 1998
for all analyses.
Overall science implementation
Consistent with the consensus-seeking curriculum foundation of the TIMSS item design,
the TICS is generally high for most countries (median of country medians = .73), with
50% of the countries being within .11 absolute distance from this value (i.e., TICS = [.62,
.84]). There are two notable exceptions with median TICS below .50: New Zealand and
Norway’s grade 8. The previously mentioned move by Norway to shift its tested TIMSS
population by one school grade upwards can be seen in the light of its low TICS for
grade 8 (Mdn = .41) compared with grade 9 (Mdn = .64). The signs of a centralized
educational system in Japan, which were mentioned in the introduction, are also reflected
in it having a low spread in TICS (MAD = .05: at least 50% of the classes in Japan have
at most 1 topic [1 ≈ .05TICS × 18 topics in total] difference from the median TICS in
the country). The largest spread in TICS is in Malta (MAD= .20), which is roughly the
equivalent of 3 topics’ difference with the country’s median TICS.
Science domain implementation
The most implemented science domain across the countries was chemistry (Mdn= .83),
followed by physics (.80), earth science (.75), and biology (.71). The between-country
spread in how much the teachers implemented the TIMSS topics spanned from the more
evenly implemented chemistry and physics domains (MAD = .00 and .00, respectively)
to biology (MAD = .14) and the most unevenly implemented earth science (MAD = .25).
Countries at both ends of the TICS scale could be found in all domains (rangebiology = .57,
rangechemistry = .67, rangeearth science = 1.00, rangephysics = .80).
TICS was quite high in biology for most countries, with the notable exception of
Norway (grade 8) and New Zealand (lowest, with Mdn = .43). TICS was very high in
chemistry, with all countries having median TICS above .50 except for Hong Kong (Mdn = .33).
TICS in earth science was characterized by a split between high median in many
countries and low median in several countries, namely Hong Kong, Ireland, Israel, Malaysia,
New Zealand, Chinese Taipei (Taiwan), and Singapore, all of which had a median below
.50. TICS in physics was generally high, with only Norway grade 8 (Mdn = .20) and grade
9 (Mdn = .40) being below .50. Thus, TICS is lower for Norway’s grade 8 than grade 9 in
overall science and all domains, and its grade 8 is lower than most other participating
countries. These findings support the claim that the Norwegian eighth school year is not
comparable with other countries’ eighth school year in terms of curriculum coverage,
whereas Norway’s grade 9 is more comparable.
Although countries that show high overall implementation will logically also have
high implementation across all four science domains, there are some distinct deviations
from the overall pattern. The earth science topics are, for instance, not taught by the
responding teachers before grade 9 in Taiwan (Chinese Taipei; Mdn = .00, MAD = .00),
even though the intended curriculum information from the TIMSS curriculum
matching analysis (TCMA) indicates complete coverage of all items there. The low
implementation of earth science topics in Singapore and Hong Kong is due to earth science being
taught in other subjects and not by the science teachers
(Mullis et al. 2016)
Within-country TICS profiles at school level The boxplots in Fig. 1 that represent
spread in implemented curriculum scores for each domain are a good reflection of
the country-level curriculum implementation profile. Yet, one might wonder whether
they hide different within-country TICS profiles at school level. Schools within some
countries might vary in the extent to which they implement the content domains. For
instance, some schools might invest heavily in biology, whereas other schools might seek
a balance across domains. Moreover, in countries with federal structures, schools in
different states or provinces might follow different science curricula. Similarly, in countries
with selective lower-secondary education, schools of different types and intake
requirements likely follow different science curricula. Each line of the spaghetti plot in Fig. 2
depicts a school, and the plot shows how much a school has implemented a domain.
On the one hand, in Chinese Taipei (Taiwan) and Singapore, most schools vary greatly
across science domains in the degree of TICS. On the other hand, in the United States
and Jordan, most schools implement the same amount across all domains, as seen by the
flat lines profile.
However, these flat lines are also parallel, indicating that this heterogeneity across
domains is very similar across schools. For instance, the implementation of domains
seems parallel for most schools in the United Arab Emirates, England, and Japan, with
only differences in the TICS ‘intercepts’ of the patterns (i.e., level of implemented
curriculum scores). This implies that some schools generally implement more than other
schools across all the domains. In contrast, in countries such as Singapore and Chinese
Taipei (Taiwan), school-level profiles are less parallel and compared to the country’s
average profile, many schools tend to implement more of some topic at the cost of other
The country-level analysis of the teacher-reported implementation of TIMSS topics
confirm that, although the implemented curriculum score is relatively high overall, there
are noticeable differences in TICSs between the participating countries in TIMSS and
between schools within a country. The next logical question to then ask is to what extent
these differences impact the countries’ science achievement scores and rankings.
TIMSS implemented curriculum score (TICS) and achievement score
Logic dictates that we can expect the relationship between degree of TICS and
achievement to be positive: Countries whose curriculum is aligned with TIMSS and that
generally focus on width and depth of science education are expected to perform well
(i.e., between-country regression effect of TICS on achievement: bT(bICetSween) > 0).
Similarly, students in schools that have high implementation of the TIMSS curriculum are
expected to perform well (i.e., within-country regression effect of TICS on achievement:
bT( wICitShin) > 0 for all countries).
Regardless of the outcome with respect to the relation between TICS and
achievement, we investigated the sensitivity of the science achievement country rankings to
differences in TICS. Five rankings were compiled, beginning with the original
international TIMSS science achievement ranking, the ranking based on the predicted
country TIMSS science achievement score if all schools within the country had a TICS score
equal to 1 (i.e., full coverage), and the ranking based on the predicted country TIMSS
science achievement score if all schools within the country had a TICS score equal to
the median reported TICS in that country. The two other rankings were predictions
based on the TICS score equal to the within-country minimum and maximum reported
TICS score, respectively. The latter two rankings would reflect the relative comparative
performance of countries at their lowest and highest level of implemented curriculum,
whereas the median-based ranking can be regarded as a more realistic TICS-adjusted
ranking and the theoretical maximum TICS-adjusted ranking offers an absolute
comparison at a utopian equal footing.
The four panels in Fig. 3a–d depict the between-country relationships for overall
science between the central tendency and spread of TICS and achievement. A simple linear
fit line is overlaid with 95% confidence intervals (white line on gray area). For instance,
Norway’s grade 8 pupils (NO8) have a low median implementation of the TIMSS
content that, combined with a mid-ranged average achievement score, makes them stand
out on the left side in Fig. 3a. Norway’s grade 9 pupils (NOR) have a somewhat higher
level of TIMSS content implementation and a higher average achievement score, which
hints at a positive link between TICS and achievement. Yet, counter to our expectations,
the regression of country-level median TICS on mean achievement shows a significant
negative slope, bT(bICetSween) = −184 [− 342, − 25] (R2 = .153). A plausible explanation of this
pattern is that quite a few of the lower-performing countries have relatively young
educational systems with (reformed) curricula being influenced by or in line with the
international educational assessments (i.e., higher TICS), whereas the higher-performing
countries typically have more established educational systems with their own historical
traditions and less tight formal connection to the international educational assessments.
The observation that countries having implemented more of the TIMSS content have
more educational outcome inequality (see Fig. 3c) might lend further support for such
an interpretation. Notice that, more in line with expectations, countries with more
between-school differences in TIMSS content implementation tend to also have more
between-school differences in school average achievement (see Fig. 3d). Yet, most
countries have rather similar degrees of within-country variation in TIMSS content
implementation, with the countries with the least spread (Bahrain) and the most spread
(Malta) in TICS both having a rather average score on science achievement (see Fig. 3b).
The forest plot in Fig. 4 displays for each country the 95% confidence interval around
b(TwIiCthSin), their within-country regression effect of TICS on science achievement. The
b(TwIiCthSin) indicates the expected difference in science achievement points between a
school whose teachers have reported full implementation of the TIMSS content (i.e., all
18 TIMSS topics were taught) and a school whose teachers have reported zero
implementation of the TIMSS content (i.e., none of the 18 TIMSS topics were taught). For
instance, the expected science achievement score in Norway for grade 8 pupils with full
opportunity to learn the TIMSS content would be 16 [− 20, 51] points higher than pupils
with no opportunity to learn the content; however, the change is not significantly
different from zero as its gray confidence interval overlaps with the dashed line. A
similar pattern occurs for Norway’s grade 9 and most other countries, with wide confidence
intervals around small point estimates for b(TwIiCthSin) reflecting the large uncertainty around
these findings. Hence, counter to our expectations, a null finding is observed for the
within-country relation between TICS and achievement.
There are some exceptions (where orange confidence intervals with triangles do not
overlap with zero). Higher implementation of the TIMSS content is associated with
higher achievement in Qatar (bT( QICAST) = 153 [50, 255], R2 = .05), Turkey (bT( TICUSR) = 120 [6,
233], R2 = .02), Singapore (bT(SICGSP) = 78 [11, 145], R2 = .03), and Malta (bT( MICLST) = 22 [3, 40],
R2 = .01). However, even in these countries, TIMSS content implementation explains at
best a tiny part of the within-country variation in achievement.1
For the sensitivity analysis, the predicted achievement for one zero TICS (Zero) and one
full TICS (Full) scenario allows for absolute comparison across countries, whereas the
one country-specific median TICS (Med) scenario allows for a realistic relative
comparison. These scenarios were compared with the original scenario (O). Figure 5 illustrates
the expected country ranks under these five scenarios, where a rank of 1 corresponds to
the highest achievement score across all countries under the given condition. For
example, Norway’s original rank (O) among the included countries in this study is 17 for its
grade 8 and 13 for its grade 9. Irrespective of whether for all countries the schools have
1 The general null findings results remain stable during the statistical analysis robustness checks.
the least possible (Zero), the most possible (Full), or each country’s median (Med) level
of TIMSS topics implementation, the ranks are quite stable. We do observe that
comparing countries at the least possible TICS level increases the width of the confidence
intervals and the uncertainty surrounding the ranking for all countries.
Stability across science domains
The forest plots for the science domains (see Appendix) also did not indicate much
support for a relationship between the degree of TIMSS content implementation and
achievement. Similarly, the ranks remained stable across the scenarios for each domain,
with only changes in the Zero TICS scenario (drop in rank for Qatar in biology and for
Singapore in chemistry; see Appendix).
TICS country profiles
This study partially supports Norway’s decision to shift its target population one
school year up. The analysis of the TICS revealed that the Norwegian grade 8 pupils
have experienced less opportunity to learn the science content that is tested in TIMSS
across all science domains, as compared with pupils in their grade 9 and compared with
pupils in most other participating countries. Yet, the analysis also revealed that New
Zealand’s eighth graders have an equally low TICS level as those in Norway across all
domains. New Zealand’s pupil sample is tested at the age (Mage = 14.1) and grade (8.5–
9.5) between Norway’s grade 8 and grade 9 (see Table 1), and its achievement score is
at the level of Norway’s grade 9. This raises a question of whether New Zealand and
other countries with low implementation relative to other participating countries can
or should make the same shift. Should more countries join the out-of-grade group of
countries in TIMSS, then country comparisons might become even more challenging
as the TIMSS participants could possibly lack both a common formal grade and a
common age link. Furthermore, analyses have yet to clarify whether such changes matter for
achievement based on the differences in degree of implementation of TIMSS content
Despite the finding of an increase in country average achievement and TICS level
between Norwegian pupils in grade 8 and grade 9, there was generally no evidence of
a positive between-country relationship between implementation and achievement.
Instead, the relationship seemed negative: Countries with higher degrees of TIMSS
content implementation tended to have lower average achievement scores. The
plausible explanation raised for this pattern was that quite a few of the lower-performing
countries have relatively young educational systems with (reformed) curricula being
more influenced by or in line with the international educational assessments, whereas
the higher-performing countries typically have more established educational systems
with their own historical traditions and less tight formal connection to the international
educational assessments (as noted previously). Hence, the between-country relationship
might be driven by different factors than what goes on within countries.
There was basically a lack of evidence of the within-country relationship between science
achievement and TICS, with only minor exceptions. Hence, the support of Norway’s
decision to move is limited because the within-country relationship between
achievement and implementation of TIMSS curriculum is weak across domains, making it
generally difficult for countries to expect higher average achievement score with higher
implementation of the TIMSS curriculum. Yet, a glance at the Norwegian data suggests
that a large increase does occur in both average achievement score and median TICS
between the eighth grade and the ninth grade. This suggests that there is more variation
in TIMSS curriculum implementation scores across grades than across schools within
a grade. However, the large increase in average achievement between cohorts might be
explained by increased age, maturity, or familiarity with formal science assessments.
The sensitivity analysis indicated that the science achievement ranks were very stable
across hypothetical scenarios compared with the original rank. In these scenarios, all
schools in each country have implemented the same level of the TIMSS content, based
on either the country-specific median or the least possible or most possible level of
TIMSS content implementation. This stability across scenarios is counter-intuitive, as
one would expect most countries to drop or climb in ranks if all schools in all
participating countries implemented the same level as the least or most possible TIMSS
content implementation. Albeit counter-intuitive, the findings are supported by previous
research that indicates that opportunity to learn might not matter much. Scheerens has
noted how the empirical evidence of the effect of opportunity to learn is often weaker
than first thought
. In Scheerens and Bosker’s meta-analyses of
various experimental and non-experimental studies on instructional factors
, only “small to negligible effects” on achievement were found for
opportunity to learn. The lack of evidence seems particularly apparent in analyses of large-scale
assessment data. The previously discussed study by Hencke et al. on the sensitivity of
mathematics achievement scores and ranks in TIMSS 2003, using the TCMA
information on each item’s coverage in a country, showed stability in achievement scores and
ranks across countries. Hence, neither the use of intended curriculum information nor
implemented curriculum information from TIMSS seems to explain much of the
variation in achievement.
The lack of evidence for a link between opportunity to learn and achievement could be
due to one or more plausible factors. A third-variable explanation is possible, but the
issue of operationalization of opportunity to learn and the validity of chosen indicators
is the crucial one in our opinion.
Conditional opportunity to learn effects
First, although there was a lack of evidence for a marginal relationship between TICS
and achievement, this might change depending on relevant contextual factors. For
instance, the effect of opportunity to learn might be conditional on socio-economic
status: Pupils from families of low socio-economic status might be more dependent on
opportunity to learn at school, whereas pupils from families of higher socio-economic
status have resources to counter poor teachers and insufficient coverage of topics.
Previous research has suggested a link between immigrant status and lower opportunity to
learn the core curriculum
(Wang and Goldschmidt 1999)
, and between socio-economic
status, student-level acquaintance with content topics, and mathematics achievement in
(Schmidt et al. 2015)
. Future research could explore the link between opportunity
to learn the TIMSS science content, indicators of socio-economic status, and science
Opportunity to learn indicators
This study initially raised issues with the use of the TCMA data on intended curriculum.
The TCMA data, albeit precise on the content side of the test (i.e. the items), suffer from
imprecise national curriculum goals and are too general for the nuances in
implementation across teachers. The current study benefits from greater precision on the teacher
side, without too great loss of precision on the content side (i.e. topics). However, the
information on implemented curriculum is still dependent upon the exact survey
questions and the interpretation of these questions by the teacher.
TIMSS surveys only the science and math teachers of the sampled classes. However,
in some countries, certain science topics in TIMSS are covered by teachers that are not
surveyed. For instance, some earth science topics are covered in the geography subject
instead of the general science class in Norway, Taiwan, and England. This means that
there might be gaps in the implemented curriculum information for some countries.
The response categories for curriculum implementation use coarse categories (taught
earlier, taught this year, not yet taught) and lack nuance in qualitative degree and time
of content implementation. Varying standards can influence when a topic is considered
taught this year: Teacher A can argue that the topic was briefly mentioned in class and
decide to respond the topic was “taught this year”, but teacher B might give the same
response only if there was a whole month spent on the topic. Another factor is the level
of detail in the teaching of the topic. For example, the cells topic could be taught at a
very superficial level (e.g., only a plant cell) or at a more detailed level (e.g., multiple
cell types and cell organelles). Different teachers are likely to have different opinions on
whether they have “implemented” a topic or not depending on the level of detail with
which they have covered it in lessons. What does it mean to have “implemented a topic”
in a class across the different participating countries?
Furthermore, a science topic might cover a broad range of science curriculum content
that does not necessarily relate to a recognizable content grouping within the teachers’
own training and teaching practice. Has a TIMSS topic such as “electricity and
magnetism” been treated as a single didactical topic in the classroom? Aggregating these topics
across domains might further obscure their intended connection to classroom practice.
As research has already indicated that performance on topics within a TIMSS domain is
heterogeneous (Daus et al. under review), a differential opportunity to learn perspective
across more specific content groups might be more fruitful than seeking global effects at
the aggregated domain level.
Our suspicion that the indicators for opportunity to learn in TIMSS indicators are to
blame for our general lack of evidence might seem odd given the success of Schmidt
et al. (2001) in finding a relationship between opportunity to learn and achievement
using the TIMSS 1995 data. However, their findings were much weaker for science
than mathematics, and the difference between our findings and those of Schmidt et al.
might be related to the much richer and more diverse implemented curriculum
indicators available in TIMSS 1995. In TIMSS 1995, intended curriculum information was
collected on textbooks and curriculum guides with topic trace mapping of the TIMSS
framework content topics across curriculum grades as well as document coding of
curriculum documents using the TIMSS framework. Implemented curriculum
information was collected from adjacent grades on more than 20 mathematics topics and more
than 20 science topics regarding whether it was taught, how much it had been taught
the last year, whether it was the subject of the last lesson, and for some topics whether
four example items from the topic were appropriate for the class. However, TIMSS is
under continuous development and has reduced the extent of the implemented
curriculum information collection since 1995. This might be problematic because, in contrast to
the intention of a “real-life literacy skills” framework in the PISA study, TIMSS is largely
based on the common curriculum of the participating countries. Hence, analyses of the
TIMSS data should include the implemented curriculum. Moreover, despite the lack of
evidence for a relationship between TICS and achievement in this study, and the
potential issues with the implemented curriculum indicators, the value of these indicators
come also from their capacity to document changes in curriculum across time within
countries and differences in curriculum between countries. Therefore, we would suggest
revaluing these implemented curriculum indicators in TIMSS by continuing to improve
their quality and scope.
Attention to opportunity to learn is important for fair comparisons of educational
systems. At first sight of the results in this study, one might thus be inclined to appreciate
that TIMSS achievement seems insensitive to differences in opportunity to learn within
countries, based on current indicators. Yet, learning clearly occurs across a child’s
development, so why is it so difficult to empirically connect the most obvious conceptual
relationship (i.e., opportunity to learn and achievement) using data from the
international educational assessments? Progress in research on the effects of curriculum
implementation can be gained only if more attention is placed on validity and precision of
the measures. One place to start the debugging is deeper scrutiny of the indicators and
instruments for opportunity to learn in TIMSS.
TIMSS: Trends in International Mathematics and Science Study; TICS: TIMSS implemented curriculum score; TCMA: TIMSS
curriculum matching analysis; ICGR: intended curriculum grade range; Mdn: median; MAD: median absolute deviation.
SD analysis and writing. JB conceptualization and writing. Both authors read and approved the final manuscript.
The authors are grateful for feedback on an early draft from Dr Trude Nilsen, Department of Teacher Education and
School Research, University of Oslo.
We have read and understood Large-scale Assessments in Education’s policy on declaration of interests and declare that
we have no competing interests.
Availability of data and materials
The TIMSS datasets supporting the conclusions of this article are available in the TIMSS & PIRLS repository (http://timss
andpirls.bc.edu/). Additional materials are available in https://osf.io/4qbya/.
Ethics approval and consent to participate
The following plots are the corresponding plots from the main text for each of the
science domains biology, chemistry, earth science, and physics.
See Figs. 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 and 17.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Bollen, K., & Jackman, R. (1990). Regression diagnostics: An expository treatment of outliers and influential cases. In J. Fox
& J. Long (Eds.), Modern methods of data analysis (pp. 257–291). Newbury Park: Sage.
Comber, L. C., & Keeves, J. P. (1973). Science education in nineteen countries; an empirical study. New York: Wiley.
Hencke , J. , Rutkowski , L. , Neuschmidt , O. , & Gonzalez , E. J. ( 2009 ). Curriculum coverage and scale correlation on TIMSS 2003 . IERI Monograph Series Issues and Methodologies in Large Scale Assessments , 2 ( 4 ), 85 - 112 .
Husén , T. , & Postlethwaite , T. N. ( 1996 ). a brief history of the international association for the evaluation of educational achievement (TEA) . Assessment in Education: Principles, Policy and Practice , 3 ( 2 ), 129 - 141 . https://doi. org/10.1080/0969594960030202.
Luyten , H. ( 2016 ). Chapter 5: Predictive power of OTL measures in TIMSS and PISA . In J. Scheerens (Ed.), Opportunity to learn, curriculum alignment and test preparation: A research review (pp. 103 - 119 ). Dordrecht: Springer.
Matsubara , K. , Hagiwara , Y. , & Saruta , Y. ( 2016 ). A statistical analysis of the characteristics of the intended curriculum for Japanese primary science and its relationship to the attained curriculum . Large-scale Assessments in Education , 4 ( 13 ), 1 - 18 . https://doi.org/10.1186/s40536-016-0028-0.
Mullis , I. V. S. ( 2013 ). TIMSS 2015 assessment frameworks . Chestnut Hill: TIMSS and PIRLS International Study Center, Lynch School of Education , Boston College.
Mullis , I. V. S. , Martin , M. O. , Goh , S. , & Cotter , K. ( 2016 ). TIMSS 2015 Encyclopedia: Education policy and curriculum in mathematics and science . Boston: Boston College, TIMSS & PIRLS International Study Center.
Muthén , L. K. , & Muthén , B. O. ( 1998 ). Mplus User's Guide (8 ed .). Los Angeles: Muthén & Muthén.
R Core Team . ( 2017 ). R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.r-project. org.
Scheerens , J . (Ed.). ( 2016 ). Opportunity to learn, curriculum alignment and test preparation: A research review . Dordrecht: Springer.
Scheerens , J. , & Bosker , R. J. ( 1997 ). The foundations of educational effectiveness (1st ed .). New York: Pergamon.
Schmidt , W. H. , Burroughs , N. A. , Zoido , P. , & Houang , R. T. ( 2015 ). The role of schooling in perpetuating educational inequality: An international perspective . Educational Researcher , 44 ( 7 ), 371 - 386 . https://doi.org/10.3102/0013189x15 603982.
Schmidt , W. H. , McKnight , C. C. , Houang , R. T., Wang , H. , Wiley, D. E. , Cogan , L. S. , et al. ( 2001 ). Why schools matter: A crossnational comparison of curriculum and learning . San Francisco: Jossey-Bass.
Wang , J. , & Goldschmidt , P. ( 1999 ). Opportunity to learn, language proficiency, and immigrant status effects on mathematics achievement . The Journal of Educational Research , 93 ( 2 ), 101 - 111 . https://doi.org/10.1080/0022067990 9597634.