Selfredepth
Journal of Real-Time Image Processing
(2024) 21:124
https://doi.org/10.1007/s11554-024-01491-z
RESEARCH
Selfredepth
Self-supervised real-time depth restoration for consumer-grade sensors
Alexandre Duarte1 · Francisco Fernandes2 · João M. Pereira1,2 · Catarina Moreira2,4 · Jacinto C. Nascimento1,3 ·
Joaquim Jorge1,2
Received: 14 September 2023 / Accepted: 3 June 2024
© The Author(s) 2024
Abstract
Depth maps produced by consumer-grade sensors suffer from inaccurate measurements and missing data from either system or scene-specific sources. Data-driven denoising algorithms can mitigate such problems; however, they require vast
amounts of ground truth depth data. Recent research has tackled this limitation using self-supervised learning techniques,
but it requires multiple RGB-D sensors. Moreover, most existing approaches focus on denoising single isolated depth maps
or specific subjects of interest highlighting a need for methods that can effectively denoise depth maps in real-time dynamic
environments. This paper extends state-of-the-art approaches for depth-denoising commodity depth devices, proposing
SelfReDepth, a self-supervised deep learning technique for depth restoration, via denoising and hole-filling by inpainting of
full-depth maps captured with RGB-D sensors. The algorithm targets depth data in video streams, utilizing multiple sequential
depth frames coupled with color data to achieve high-quality depth videos with temporal coherence. Finally, SelfReDepth
is designed to be compatible with various RGB-D sensors and usable in real-time scenarios as a pre-processing step before
applying other depth-dependent algorithms. Our results demonstrate our approach’s real-time performance on real-world
datasets shows that it outperforms state-of-the-art methods in denoising and restoration performance at over 30 fps on Commercial Depth Cameras, with potential benefits for augmented and mixed-reality applications.
Keywords Deep learning · Self-supervised learning · Image denoising · Image reconstruction · RGB-D sensors
Mathematics Subject Classification 68T07 · 94A08
* Joaquim Jorge
Alexandre Duarte
Francisco Fernandes
João M. Pereira
Catarina Moreira
1
Instituto Superior Técnico, Universidade de Lisboa (ISTUL), 1000‑029 Lisbon, Portugal
2
Instituto de Engenharia de Sistemas e Computadores,
Investigação e Desenvolvimento (INESC-ID),
1000‑029 Lisbon, Portugal
3
Institute for System and Robotics (ISR), Instituto
Superior Técnico, Universidade de Lisboa (IST-UL),
1049‑001 Lisbon, Portugal
4
Human Technology Institute, University of Technology
Sydney, Sydney, Australia
Jacinto C. Nascimento
Vol.:(0123456789)
124
Page 2 of 14
1 Introduction
Depth information is pivotal in many applications, from
digital entertainment to virtual and augmented reality [21].
It is the backbone for digital object and environment modeling [8, 42] and cost-effective motion capture solutions [18].
Pose estimation derived from depth data finds utility
in diverse fields such as physiotherapy [5, 17], video
surveillance [34, 63], and human–computer interaction
[46]. Depth data also aids autonomous navigation [15]
and enhances security measures through facial recognition
[43].
Consumer depth devices, often employing low-cost
LiDAR, structured light, or time-of-flight technologies,
are instrumental in these applications. Among these, the
Microsoft Kinect v2 stands out for its balance of quality,
availability, and affordability. However, consumergrade sensors like Kinect v2 still grapple with noisy and
incomplete data issues.
Efforts to address these quality issues span traditional
smoothing techniques to data-driven machine learning
algorithms. Many adopt supervised learning with neural
networks, training models on noisy-clean data pairs (̂x, y)
to minimize empirical risk.
However, acquiring clean training data is non-trivial.
Recent attention has thus shifted towards self-supervised
techniques, such as Noise2Noise [27], which leverages
noisy-noisy data pairs (̂x, ŷ ) for training,
and minimizing
∑N � � � �
the cost function
g(𝜃)
=
argmin
L
f
x̂ i , ŷ i , where
𝜃
𝜃
i
( )
the network f𝜃 x̂ i is parameterize by 𝜃.
Despite their efficacy in various domains, selfsupervised methods for depth data restoration remain
underexplored, largely due to the intricate noise patterns
in consumer-grade sensors.
Our paper introduces SelfReDepth (SReD), a novel
self-supervised, real-time depth data restoration technique
optimized for the Kinect v2. SelfReDepth introduces a
convolutional autoencoder architecture inspired by U-Net,
specifically designed to process sequential depth frames
efficiently. This design choice directly responds to the
need for maintaining temporal coherence in dynamic
scenes, a gap often left unaddressed by traditional singleframe denoising approaches. Furthermore, SelfReDepth
incorporates RGB data into the depth restoration process
as an innovative way to enhance the accuracy of inpainting
missing pixels by providing contextual color information.
This method significantly improves the restoration quality
by providing additional context that depth data alone
lacks. Our contributions are fourfold: (1) We employ a
convolutional autoencoder with an architecture akin to
U-Net [47] to process sequential frames. (2) Our method
achieves real-time performance and temporal coherence
Journal of Real-Time Image Processing
(2024) 21:124
by adopting a video-centric approach. (3) We incorporate
RGB data to guide an inpainting algorithm during training,
enhancing the model’s ability to complete missing depth
pixels. (4) Our approach maintains a 30 fps real-time rate
while outperforming state-of-the-art techniques.
2 Background and related work
In recent years, depth-sensing technology has emerged
as a pivotal tool in various applications, from gaming to
augmented reality and robotics. The promise of capturing
the third dimension, depth, has opened up new horizons in
computer vision, augmented reality, and human–computer
interaction. Next, we introduce some concepts and
methodologies related to the present work.
Denoising vs. inpainting: The distinction between
denoising and inpainting is important to be stressed, as
these terms will be used throughout this work constituting
important stages of the proposed methodology. Denoising
and inpainting are two core image processing problems.
As the name suggests, denoising removes noise from an
observed noisy image, while inpainting aims to estimate
missing image pixels. Both denoising and inpainting are
inverse problems: the common goal is to infer an underlying
image from incomplete/imperfect observations. Formally, in
�
�
both problems the observed image Y ∈ ℝM ×N is modeled
𝕄×ℕ
as Y = F(X) + 𝜂 where X ∈ ℝ
is the unknown (original)
image and 𝜂 is the observed noise. The difference between
the denoising and the inpainting emerges from the mapping
�
�
F ∶ ℝM×N ↦ ℝM ×N that expresses a linear degradation
operator that could represent a convolu (...truncated)