Fixation duration on natural scenes is explained by memory encoding not processing demand

Nature Neuroscience, May 2026

Before each of around 200,000 eye movements we make each day, the brain decides how long to fixate before shifting gaze to new information. Here we investigate this process using a large-scale scene-viewing experiment (4,080 natural scenes, five participants) that combines magnetoencephalography, eye tracking and a semantic captioning task. Using multivariate analysis of magnetoencephalography source-space patterns, behavioral analyses and artificial neural network (ANN) modeling, we show that longer fixations do not reflect prolonged visual processing but relate to downstream memory encoding. First, temporal variability of ventral stream representational dynamics did not explain variability in fixation duration. Second, fixation durations were anticorrelated with ANN-estimated patch classification difficulty. Third, fixation durations correlate positively with ANN-predicted patch memorability and caption-inclusion and co-occur with increased theta–gamma phase–amplitude coupling, particularly in frontal and hippocampal regions. These results indicate that eye-movement timing decisions are shaped by memory-encoding demands rather than by perceptual processing limits.

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41593-026-02285-1.pdf

Fixation duration on natural scenes is explained by memory encoding not processing demand

nature neuroscience Article https://doi.org/10.1038/s41593-026-02285-1 Fixation duration on natural scenes is explained by memory encoding not processing demand Received: 2 July 2025 Accepted: 27 March 2026 Published online: 25 May 2026 Check for updates Philip Sulewski 1,2 , Carmen Amme1, Martin N. Hebart Peter König 1,5 & Tim C. Kietzmann 1 , 2,3,4 Before each of around 200,000 eye movements we make each day, the brain decides how long to fixate before shifting gaze to new information. Here we investigate this process using a large-scale scene-viewing experiment (4,080 natural scenes, five participants) that combines magnetoencephalography, eye tracking and a semantic captioning task. Using multivariate analysis of magnetoencephalography source-space patterns, behavioral analyses and artificial neural network (ANN) modeling, we show that longer fixations do not reflect prolonged visual processing but relate to downstream memory encoding. First, temporal variability of ventral stream representational dynamics did not explain variability in fixation duration. Second, fixation durations were anticorrelated with ANN-estimated patch classification difficulty. Third, fixation durations correlate positively with ANN-predicted patch memorability and caption-inclusion and co-occur with increased theta–gamma phase–amplitude coupling, particularly in frontal and hippocampal regions. These results indicate that eye-movement timing decisions are shaped by memory-encoding demands rather than by perceptual processing limits. Natural vision continuously involves deciding between maintaining our current fixation or redirecting our gaze to explore new visual information. Considering the high information density and dynamic changes of the natural world, a fast sampling of varied information seems preferable. However, our eyes remain at some locations for more than 500 ms, whereas others are visited for less than 150 ms (refs. 1,2), a striking delay given the brain’s remarkably fast processing speeds3,4. This large variation raises a fundamental question about the brain’s underlying computational strategy. Although previous research has described correlations between fixation durations and local image features (for example, local contrast, edge density), task demands and exploration sequence parameters1,2,5–9, our understanding of the underlying neural information processing remains limited. We narrow this gap by developing process-related predictors that test specific hypotheses about the neural information processing mechanisms that guide fixation timing. A prevalent theory of why our eyes rest longer at some locations is based on the consideration that the brain may require varying amounts of time to extract information. Indeed, in the case of static vision in macaques, Kar et al.10 demonstrated the need for recurrent information processing in cases of challenging stimuli, indicating that more complex visual stimuli demand prolonged neural computational time. Analogously, recurrent artificial neural network (ANN) models were shown to align with human reaction times in an animacy classification Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany. 2Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany. 3Department of Medicine, Justus Liebig University Giessen, Giessen, Germany. 4Center for Mind, Brain and Behavior, Universities of Marburg, Giessen and Darmstadt, Marburg, Germany. 5Department of Neurophysiology and Pathophysiology, Center of Experimental Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, Germany. e-mail: ; 1 Nature Neuroscience | Volume 29 | June 2026 | 1488–1497 1488 Article https://doi.org/10.1038/s41593-026-02285-1 a b Fixation aligned neural dynamics Eye movements MEG signal "A cat is relaxing on a computer ..." 1s Source projection 4s AVS = 1s + MEG + Active vision Natural scenes dataset (NSD) Source vertices After 25 % of scenes: record semantic scene caption 4s 8s Time 1–2 s c d 4,080 natural scenes per participant Caption LLM-embedding cluster 50 40 30 20 10 0 0 1 2 3 Proportion of cluster scenes in dataset (%) Fig. 1 | Understanding the neural dynamics of active scene viewing: the Active Visual Semantics dataset. a, Experimental design combining MEG recordings with natural scene exploration. Participants freely viewed 4,080 scenes (subsampled from the NSD dataset13) for 4 s each, with eye movements continuously recorded. After 25% of randomly selected trials, participants were tasked to verbally caption the scene. b, Analysis pipeline showing that MEG signals were source-projected and analyzed time-locked to fixation events to capture neural dynamics during natural viewing. c, Large language model (LLM) caption embedding space visualization (t-distributed stochastic neighbor embedding (t-SNE)) of the stimulus set. Each dot represents a scene, with colors indicating semantic clusters used for balanced sampling from the NSD scenes (gray dots). d, Distribution of scenes across 60 semantic clusters, demonstrating semantically balanced sampling. Green bars show the proportion of scenes per cluster in the Active Visual Semantics (AVS) dataset compared to the original NSD distribution (gray). Photos from the COCO image dataset/Flickr51. task11. The corresponding processing-demand hypothesis expects longer fixation durations for local image features that are comparably challenging to recognize. On the neural level, this view proposes a delayed convergence of representational dynamics when fixating challenging targets compared to simpler ones. However, an alternative account is also possible. Based on the observation that ventral stream representations exhibit only limited cross-fixation integration12, ventral stream information is largely overwritten as the eyes move on to a new location. Prolonged fixations could therefore reflect a strategic time allocation of the brain to actively stabilize the neural embedding of a fixated patch to support downstream processes, such as memory encoding, before allowing a disruption by the next saccade. We term this alternative account of fixation durations the memory-facilitation hypothesis. To differentiate between these two hypotheses, we collected a large-scale magnetoencephalography (MEG) dataset in which five participants actively explored 4,080 natural scenes (Fig. 1). The experiment consisted of a scene-captioning task during which participants freely explored each natural scene for 4 s, in 25% of cases followed by a request to provide a verbal semantic description (Fig. 1a). Throughout the experiment, we simultaneously recorded MEG signals and eye movements, allowing us to analyze the neural dynamics time-locked to fixations (Fig. 1b). The scenes were subsampled from the Natural Scenes Dataset (NSD13). To create a semantically diverse, yet balanced, stimulus set, we clustere (...truncated)


This is a preview of a remote PDF: https://www.nature.com/articles/s41593-026-02285-1.pdf
Article home page: https://www.nature.com/articles/s41593-026-02285-1

Philip Sulewski, Carmen Amme, Martin N. Hebart, Peter König, Tim C. Kietzmann. Fixation duration on natural scenes is explained by memory encoding not processing demand, Nature Neuroscience, 2026, DOI: 10.1038/s41593-026-02285-1