Swin-Diff: a single defocus image deblurring network based on diffusion model (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40747-025-01789-w.pdf

Swin-Diff: a single defocus image deblurring network based on diffusion model

Complex & Intelligent Systems (2025) 11:170 https://doi.org/10.1007/s40747-025-01789-w ORIGINAL ARTICLE Swin-Diff: a single defocus image deblurring network based on diffusion model Hanyan Liang1,2,3 · Shuyao Chai1,2,3 · Xixuan Zhao1,2 · Jiangming Kan1,2,3 Received: 27 August 2024 / Accepted: 27 January 2025 / Published online: 17 February 2025 © The Author(s) 2025 Abstract Single Image Defocus Deblurring (SIDD) remains challenging due to spatially varying blur kernels, particularly in processing high-resolution images where traditional methods often struggle with artifact generation, detail preservation, and computational efficiency. This paper presents Swin-Diff, a novel architecture integrating diffusion models with Transformer-based networks for robust defocus deblurring. Our approach employs a two-stage training strategy where a diffusion model generates prior information in a compact latent space, which is then hierarchically fused with intermediate features to guide the regression model. The architecture incorporates a dual-dimensional self-attention mechanism operating across channel and spatial domains, enhancing long-range modeling capabilities while maintaining linear computational complexity. Extensive experiments on three public datasets (DPDD, RealDOF, and RTF) demonstrate Swin-Diff’s superior performance, achieving average improvements of 1.37% in PSNR, 3.6% in SSIM, 2.3% in MAE, and 25.2% in LPIPS metrics compared to state-ofthe-art methods. Our results validate the effectiveness of combining diffusion models with hierarchical attention mechanisms for high-quality defocus blur removal. Keywords Diffusion model · Single image defocus deblurring · Image restoration Introduction Defocus blur, a prevalent degradation artifact in image acquisition systems, poses significant challenges in both the perceptual quality of images and the performance of the following computer vision tasks, including object detection, object Hanyan Liang and Shuyao Chai contributed equally. B Xixuan Zhao Hanyan Liang Shuyao Chai Jiangming Kan 1 School of Technology, Beijing Forestry University, Beijing 100083, China 2 State Key Laboratory of Efficient Production of Forest Resources, Beijing 100083, China 3 Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment and Automation, Beijing 100083, China identification, semantic segmentation, and quality assessment [27, 29, 50, 56]. Single Image Defocus Deblurring (SIDD) has emerged as a fundamental research problem in computer vision, with the objective of recovering sharp, wellfocused images from their defocused counterparts. Recent progress in deep learning architectures has established a robust framework for diverse real-world image restoration applications, including military surveillance, medical imaging diagnostics, intelligent transportation systems, and beyond [1, 3–7, 14, 25, 31, 32, 51, 54], underscoring the extensive applicability of defocus deblurring methods. In the field of defocus deblurring, there are mainly two primary paradigms: end-to-end approaches [41, 57] and two-stage methods [19, 38, 42]. The two-stage paradigm, which initially performs defocus region detection through edge detection, frequency domain analysis, or deep learning techniques, followed by targeted deblurring recovery of the identified regions [20, 37], exhibits inherent limitations, including interstage error propagation, computational inefficiency, and limited scene generalization, leading to its gradual supersession by single-stage end-to-end architectures. A pivotal advancement emerged through the work of Abuolaim and Brown [2], who introduced a pioneering UNet- 123 170 Page 2 of 13 based architecture for all-in-focus image prediction utilizing dual-pixel sensor data, exemplifying the potential of end-toend approaches; however, the method’s dependence on dualview inputs inherently constrains its widespread adoption. Similarly, Restormer [52], another notable regression-based end-to-end framework, while achieving remarkable performance in image restoration tasks, demonstrates limitations in handling complex, spatially varying blur kernels, particularly in scenarios requiring fine detail reconstruction in severely defocused regions. These inherent limitations of regressionbased methods have catalyzed research interest in leveraging pre-trained large models for supervised deblurring tasks, where diffusion model (DM) have emerged as a particularly promising framework, exhibiting exceptional capability in incorporating conditional constraints for image restoration while enabling progressive recovery of high-quality images from noise distributions. Diffusion models (DM) [15] have demonstrated remarkable efficacy in image synthesis [35, 40] and restoration [33, 36]. Their ability to model complex natural image details through iterative denoising of Gaussian white noise with parameter sharing presents significant potential. However, practical implementation faces challenges due to computational overhead and the requirement for multiple inference steps. Moreover, the tendency of diffusion models to introduce artifacts necessitates careful consideration. Building on previous work [9], we address these limitations by implementing (DM) in low-dimensional latent space, synthesizing regression-based and (DM)-based approaches to optimize both computational efficiency and performance. While CNN-based attention mechanisms demonstrate excellence in pattern extraction from large-scale data, their inherent limitations in receptive field size and fixed weight structures fundamentally constrain their effectiveness in deblurring applications [8, 24]. Self-attention (SA) mechanisms [10, 44, 46, 53] and regression-based features, although capable of capturing long-range pixel interactions, face significant scalability challenges in high-resolution image processing due to their quadratic computational complexity. To address these limitations, we propose an efficient Transformer architecture that incorporates mixed spatial and channel attention mechanisms, utilizing localized 8×8 spatial windows to optimize performance while maintaining computational efficiency. We present Swin-Diff, a novel defocus blur restoration model employing a two-phase training strategy. The first phase focuses on latent compression and basic regression models, utilizing a hierarchical Transformer architecture with mixed attention mechanisms. A latent encoder (LE) compresses images into compact latent representations, which are then integrated with intermediate features through a hierarchical integration module (HIM). The second phase leverages these (LE)-generated prior features to train the 123 Complex & Intelligent Systems (2025) 11:170 diffusion model, guiding blur removal through HIM during prediction. Our key contributions include: 1. The development of Swin-Diff for multi-scale defocus deblurring, combining diffusion models with hier (...truncated)