Fixation duration on natural scenes is explained by memory encoding not processing demand
nature neuroscience
Article
https://doi.org/10.1038/s41593-026-02285-1
Fixation duration on natural scenes is
explained by memory encoding not
processing demand
Received: 2 July 2025
Accepted: 27 March 2026
Published online: 25 May 2026
Check for updates
Philip Sulewski 1,2 , Carmen Amme1, Martin N. Hebart
Peter König 1,5 & Tim C. Kietzmann 1
,
2,3,4
Before each of around 200,000 eye movements we make each day, the brain
decides how long to fixate before shifting gaze to new information. Here we
investigate this process using a large-scale scene-viewing experiment (4,080
natural scenes, five participants) that combines magnetoencephalography,
eye tracking and a semantic captioning task. Using multivariate analysis of
magnetoencephalography source-space patterns, behavioral analyses and
artificial neural network (ANN) modeling, we show that longer fixations do
not reflect prolonged visual processing but relate to downstream memory
encoding. First, temporal variability of ventral stream representational
dynamics did not explain variability in fixation duration. Second, fixation
durations were anticorrelated with ANN-estimated patch classification
difficulty. Third, fixation durations correlate positively with ANN-predicted
patch memorability and caption-inclusion and co-occur with increased
theta–gamma phase–amplitude coupling, particularly in frontal and
hippocampal regions. These results indicate that eye-movement timing
decisions are shaped by memory-encoding demands rather than by
perceptual processing limits.
Natural vision continuously involves deciding between maintaining
our current fixation or redirecting our gaze to explore new visual
information. Considering the high information density and dynamic
changes of the natural world, a fast sampling of varied information
seems preferable. However, our eyes remain at some locations for more
than 500 ms, whereas others are visited for less than 150 ms (refs. 1,2),
a striking delay given the brain’s remarkably fast processing speeds3,4.
This large variation raises a fundamental question about the brain’s
underlying computational strategy.
Although previous research has described correlations between
fixation durations and local image features (for example, local contrast,
edge density), task demands and exploration sequence parameters1,2,5–9,
our understanding of the underlying neural information processing
remains limited. We narrow this gap by developing process-related
predictors that test specific hypotheses about the neural information
processing mechanisms that guide fixation timing.
A prevalent theory of why our eyes rest longer at some locations is
based on the consideration that the brain may require varying amounts
of time to extract information. Indeed, in the case of static vision in
macaques, Kar et al.10 demonstrated the need for recurrent information processing in cases of challenging stimuli, indicating that more
complex visual stimuli demand prolonged neural computational time.
Analogously, recurrent artificial neural network (ANN) models were
shown to align with human reaction times in an animacy classification
Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany. 2Vision and Computational Cognition Group, Max Planck Institute for
Human Cognitive and Brain Sciences, Leipzig, Germany. 3Department of Medicine, Justus Liebig University Giessen, Giessen, Germany. 4Center for
Mind, Brain and Behavior, Universities of Marburg, Giessen and Darmstadt, Marburg, Germany. 5Department of Neurophysiology and Pathophysiology,
Center of Experimental Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
e-mail: ;
1
Nature Neuroscience | Volume 29 | June 2026 | 1488–1497
1488
Article
https://doi.org/10.1038/s41593-026-02285-1
a
b
Fixation aligned neural dynamics
Eye movements
MEG signal
"A cat is relaxing on a
computer ..."
1s
Source projection
4s
AVS =
1s
+
MEG
+
Active
vision
Natural scenes
dataset (NSD)
Source vertices
After 25 % of scenes:
record semantic scene caption
4s
8s
Time
1–2 s
c
d
4,080 natural scenes
per participant
Caption LLM-embedding cluster
50
40
30
20
10
0
0
1
2
3
Proportion of cluster scenes in dataset (%)
Fig. 1 | Understanding the neural dynamics of active scene viewing: the
Active Visual Semantics dataset. a, Experimental design combining MEG
recordings with natural scene exploration. Participants freely viewed 4,080
scenes (subsampled from the NSD dataset13) for 4 s each, with eye movements
continuously recorded. After 25% of randomly selected trials, participants were
tasked to verbally caption the scene. b, Analysis pipeline showing that MEG
signals were source-projected and analyzed time-locked to fixation events to
capture neural dynamics during natural viewing. c, Large language model (LLM)
caption embedding space visualization (t-distributed stochastic neighbor
embedding (t-SNE)) of the stimulus set. Each dot represents a scene, with colors
indicating semantic clusters used for balanced sampling from the NSD scenes
(gray dots). d, Distribution of scenes across 60 semantic clusters, demonstrating
semantically balanced sampling. Green bars show the proportion of scenes per
cluster in the Active Visual Semantics (AVS) dataset compared to the original NSD
distribution (gray). Photos from the COCO image dataset/Flickr51.
task11. The corresponding processing-demand hypothesis expects
longer fixation durations for local image features that are comparably challenging to recognize. On the neural level, this view proposes
a delayed convergence of representational dynamics when fixating
challenging targets compared to simpler ones.
However, an alternative account is also possible. Based on the
observation that ventral stream representations exhibit only limited
cross-fixation integration12, ventral stream information is largely overwritten as the eyes move on to a new location. Prolonged fixations could
therefore reflect a strategic time allocation of the brain to actively stabilize the neural embedding of a fixated patch to support downstream
processes, such as memory encoding, before allowing a disruption by
the next saccade. We term this alternative account of fixation durations
the memory-facilitation hypothesis.
To differentiate between these two hypotheses, we collected a
large-scale magnetoencephalography (MEG) dataset in which five
participants actively explored 4,080 natural scenes (Fig. 1). The
experiment consisted of a scene-captioning task during which participants freely explored each natural scene for 4 s, in 25% of cases
followed by a request to provide a verbal semantic description (Fig. 1a).
Throughout the experiment, we simultaneously recorded MEG signals and eye movements, allowing us to analyze the neural dynamics
time-locked to fixations (Fig. 1b). The scenes were subsampled from
the Natural Scenes Dataset (NSD13). To create a semantically diverse,
yet balanced, stimulus set, we clustere (...truncated)