Selfredepth (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-024-01491-z.pdf

Selfredepth

Journal of Real-Time Image Processing (2024) 21:124 https://doi.org/10.1007/s11554-024-01491-z RESEARCH Selfredepth Self-supervised real-time depth restoration for consumer-grade sensors Alexandre Duarte1 · Francisco Fernandes2 · João M. Pereira1,2 · Catarina Moreira2,4 · Jacinto C. Nascimento1,3 · Joaquim Jorge1,2 Received: 14 September 2023 / Accepted: 3 June 2024 © The Author(s) 2024 Abstract Depth maps produced by consumer-grade sensors suffer from inaccurate measurements and missing data from either system or scene-specific sources. Data-driven denoising algorithms can mitigate such problems; however, they require vast amounts of ground truth depth data. Recent research has tackled this limitation using self-supervised learning techniques, but it requires multiple RGB-D sensors. Moreover, most existing approaches focus on denoising single isolated depth maps or specific subjects of interest highlighting a need for methods that can effectively denoise depth maps in real-time dynamic environments. This paper extends state-of-the-art approaches for depth-denoising commodity depth devices, proposing SelfReDepth, a self-supervised deep learning technique for depth restoration, via denoising and hole-filling by inpainting of full-depth maps captured with RGB-D sensors. The algorithm targets depth data in video streams, utilizing multiple sequential depth frames coupled with color data to achieve high-quality depth videos with temporal coherence. Finally, SelfReDepth is designed to be compatible with various RGB-D sensors and usable in real-time scenarios as a pre-processing step before applying other depth-dependent algorithms. Our results demonstrate our approach’s real-time performance on real-world datasets shows that it outperforms state-of-the-art methods in denoising and restoration performance at over 30 fps on Commercial Depth Cameras, with potential benefits for augmented and mixed-reality applications. Keywords Deep learning · Self-supervised learning · Image denoising · Image reconstruction · RGB-D sensors Mathematics Subject Classification 68T07 · 94A08 * Joaquim Jorge Alexandre Duarte Francisco Fernandes João M. Pereira Catarina Moreira 1 Instituto Superior Técnico, Universidade de Lisboa (ISTUL), 1000‑029 Lisbon, Portugal 2 Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento (INESC-ID), 1000‑029 Lisbon, Portugal 3 Institute for System and Robotics (ISR), Instituto Superior Técnico, Universidade de Lisboa (IST-UL), 1049‑001 Lisbon, Portugal 4 Human Technology Institute, University of Technology Sydney, Sydney, Australia Jacinto C. Nascimento Vol.:(0123456789) 124 Page 2 of 14 1 Introduction Depth information is pivotal in many applications, from digital entertainment to virtual and augmented reality [21]. It is the backbone for digital object and environment modeling [8, 42] and cost-effective motion capture solutions [18]. Pose estimation derived from depth data finds utility in diverse fields such as physiotherapy [5, 17], video surveillance [34, 63], and human–computer interaction [46]. Depth data also aids autonomous navigation [15] and enhances security measures through facial recognition [43]. Consumer depth devices, often employing low-cost LiDAR, structured light, or time-of-flight technologies, are instrumental in these applications. Among these, the Microsoft Kinect v2 stands out for its balance of quality, availability, and affordability. However, consumergrade sensors like Kinect v2 still grapple with noisy and incomplete data issues. Efforts to address these quality issues span traditional smoothing techniques to data-driven machine learning algorithms. Many adopt supervised learning with neural networks, training models on noisy-clean data pairs (̂x, y) to minimize empirical risk. However, acquiring clean training data is non-trivial. Recent attention has thus shifted towards self-supervised techniques, such as Noise2Noise [27], which leverages noisy-noisy data pairs (̂x, ŷ ) for training, and minimizing ∑N � � � � the cost function g(𝜃) = argmin L f x̂ i , ŷ i , where 𝜃 𝜃 i ( ) the network f𝜃 x̂ i is parameterize by 𝜃. Despite their efficacy in various domains, selfsupervised methods for depth data restoration remain underexplored, largely due to the intricate noise patterns in consumer-grade sensors. Our paper introduces SelfReDepth (SReD), a novel self-supervised, real-time depth data restoration technique optimized for the Kinect v2. SelfReDepth introduces a convolutional autoencoder architecture inspired by U-Net, specifically designed to process sequential depth frames efficiently. This design choice directly responds to the need for maintaining temporal coherence in dynamic scenes, a gap often left unaddressed by traditional singleframe denoising approaches. Furthermore, SelfReDepth incorporates RGB data into the depth restoration process as an innovative way to enhance the accuracy of inpainting missing pixels by providing contextual color information. This method significantly improves the restoration quality by providing additional context that depth data alone lacks. Our contributions are fourfold: (1) We employ a convolutional autoencoder with an architecture akin to U-Net [47] to process sequential frames. (2) Our method achieves real-time performance and temporal coherence Journal of Real-Time Image Processing (2024) 21:124 by adopting a video-centric approach. (3) We incorporate RGB data to guide an inpainting algorithm during training, enhancing the model’s ability to complete missing depth pixels. (4) Our approach maintains a 30 fps real-time rate while outperforming state-of-the-art techniques. 2 Background and related work In recent years, depth-sensing technology has emerged as a pivotal tool in various applications, from gaming to augmented reality and robotics. The promise of capturing the third dimension, depth, has opened up new horizons in computer vision, augmented reality, and human–computer interaction. Next, we introduce some concepts and methodologies related to the present work. Denoising vs. inpainting: The distinction between denoising and inpainting is important to be stressed, as these terms will be used throughout this work constituting important stages of the proposed methodology. Denoising and inpainting are two core image processing problems. As the name suggests, denoising removes noise from an observed noisy image, while inpainting aims to estimate missing image pixels. Both denoising and inpainting are inverse problems: the common goal is to infer an underlying image from incomplete/imperfect observations. Formally, in � � both problems the observed image Y ∈ ℝM ×N is modeled 𝕄×ℕ as Y = F(X) + 𝜂 where X ∈ ℝ is the unknown (original) image and 𝜂 is the observed noise. The difference between the denoising and the inpainting emerges from the mapping � � F ∶ ℝM×N ↦ ℝM ×N that expresses a linear degradation operator that could represent a convolu (...truncated)