Swin-Diff: a single defocus image deblurring network based on diffusion model
Complex & Intelligent Systems (2025) 11:170
https://doi.org/10.1007/s40747-025-01789-w
ORIGINAL ARTICLE
Swin-Diff: a single defocus image deblurring network based on
diffusion model
Hanyan Liang1,2,3 · Shuyao Chai1,2,3 · Xixuan Zhao1,2
· Jiangming Kan1,2,3
Received: 27 August 2024 / Accepted: 27 January 2025 / Published online: 17 February 2025
© The Author(s) 2025
Abstract
Single Image Defocus Deblurring (SIDD) remains challenging due to spatially varying blur kernels, particularly in processing
high-resolution images where traditional methods often struggle with artifact generation, detail preservation, and computational efficiency. This paper presents Swin-Diff, a novel architecture integrating diffusion models with Transformer-based
networks for robust defocus deblurring. Our approach employs a two-stage training strategy where a diffusion model generates prior information in a compact latent space, which is then hierarchically fused with intermediate features to guide the
regression model. The architecture incorporates a dual-dimensional self-attention mechanism operating across channel and
spatial domains, enhancing long-range modeling capabilities while maintaining linear computational complexity. Extensive
experiments on three public datasets (DPDD, RealDOF, and RTF) demonstrate Swin-Diff’s superior performance, achieving
average improvements of 1.37% in PSNR, 3.6% in SSIM, 2.3% in MAE, and 25.2% in LPIPS metrics compared to state-ofthe-art methods. Our results validate the effectiveness of combining diffusion models with hierarchical attention mechanisms
for high-quality defocus blur removal.
Keywords Diffusion model · Single image defocus deblurring · Image restoration
Introduction
Defocus blur, a prevalent degradation artifact in image acquisition systems, poses significant challenges in both the perceptual quality of images and the performance of the following computer vision tasks, including object detection, object
Hanyan Liang and Shuyao Chai contributed equally.
B
Xixuan Zhao
Hanyan Liang
Shuyao Chai
Jiangming Kan
1
School of Technology, Beijing Forestry University, Beijing
100083, China
2
State Key Laboratory of Efficient Production of Forest
Resources, Beijing 100083, China
3
Key Laboratory of National Forestry and Grassland
Administration on Forestry Equipment and Automation,
Beijing 100083, China
identification, semantic segmentation, and quality assessment [27, 29, 50, 56]. Single Image Defocus Deblurring
(SIDD) has emerged as a fundamental research problem in
computer vision, with the objective of recovering sharp, wellfocused images from their defocused counterparts. Recent
progress in deep learning architectures has established a
robust framework for diverse real-world image restoration applications, including military surveillance, medical
imaging diagnostics, intelligent transportation systems, and
beyond [1, 3–7, 14, 25, 31, 32, 51, 54], underscoring the
extensive applicability of defocus deblurring methods.
In the field of defocus deblurring, there are mainly two
primary paradigms: end-to-end approaches [41, 57] and
two-stage methods [19, 38, 42]. The two-stage paradigm,
which initially performs defocus region detection through
edge detection, frequency domain analysis, or deep learning techniques, followed by targeted deblurring recovery
of the identified regions [20, 37], exhibits inherent limitations, including interstage error propagation, computational
inefficiency, and limited scene generalization, leading to its
gradual supersession by single-stage end-to-end architectures. A pivotal advancement emerged through the work of
Abuolaim and Brown [2], who introduced a pioneering UNet-
123
170
Page 2 of 13
based architecture for all-in-focus image prediction utilizing
dual-pixel sensor data, exemplifying the potential of end-toend approaches; however, the method’s dependence on dualview inputs inherently constrains its widespread adoption.
Similarly, Restormer [52], another notable regression-based
end-to-end framework, while achieving remarkable performance in image restoration tasks, demonstrates limitations in
handling complex, spatially varying blur kernels, particularly
in scenarios requiring fine detail reconstruction in severely
defocused regions. These inherent limitations of regressionbased methods have catalyzed research interest in leveraging
pre-trained large models for supervised deblurring tasks,
where diffusion model (DM) have emerged as a particularly
promising framework, exhibiting exceptional capability in
incorporating conditional constraints for image restoration
while enabling progressive recovery of high-quality images
from noise distributions.
Diffusion models (DM) [15] have demonstrated remarkable efficacy in image synthesis [35, 40] and restoration [33,
36]. Their ability to model complex natural image details
through iterative denoising of Gaussian white noise with
parameter sharing presents significant potential. However,
practical implementation faces challenges due to computational overhead and the requirement for multiple inference
steps. Moreover, the tendency of diffusion models to introduce artifacts necessitates careful consideration. Building on
previous work [9], we address these limitations by implementing (DM) in low-dimensional latent space, synthesizing
regression-based and (DM)-based approaches to optimize
both computational efficiency and performance.
While CNN-based attention mechanisms demonstrate
excellence in pattern extraction from large-scale data, their
inherent limitations in receptive field size and fixed weight
structures fundamentally constrain their effectiveness in
deblurring applications [8, 24]. Self-attention (SA) mechanisms [10, 44, 46, 53] and regression-based features,
although capable of capturing long-range pixel interactions,
face significant scalability challenges in high-resolution
image processing due to their quadratic computational complexity. To address these limitations, we propose an efficient
Transformer architecture that incorporates mixed spatial and
channel attention mechanisms, utilizing localized 8×8 spatial windows to optimize performance while maintaining
computational efficiency.
We present Swin-Diff, a novel defocus blur restoration
model employing a two-phase training strategy. The first
phase focuses on latent compression and basic regression
models, utilizing a hierarchical Transformer architecture
with mixed attention mechanisms. A latent encoder (LE)
compresses images into compact latent representations,
which are then integrated with intermediate features through
a hierarchical integration module (HIM). The second phase
leverages these (LE)-generated prior features to train the
123
Complex & Intelligent Systems (2025) 11:170
diffusion model, guiding blur removal through HIM during
prediction. Our key contributions include:
1. The development of Swin-Diff for multi-scale defocus
deblurring, combining diffusion models with hier (...truncated)