A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation
Wu et al. EURASIP Journal on Advances in Signal
Processing
A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation
Bo Wu 0 3
Minglei Yang 0 3
Chin-Hui Lee 2
Kehuang Li 2
Zhen Huang 2
Sabato Marco Siniscalchi 1 2
Tong Wang 0 3
0 National Laboratory of Radar Signal Processing, Xidian University , Xi'an , China
1 Department of Telecommunications, University of Enna Kore , Enna , Italy
2 School of Electrical and Computer Engineering, Georgia Institute of Technology , Atlanta , USA
3 National Laboratory of Radar Signal Processing, Xidian University , Xi'an , China
A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error.
Deep neural networks (DNNs); Simultaneous speech dereverberation and beamforming; Auto-correlation function; Temporal and spatial contexts; Reverberation-time-aware (RTA)
1 Introduction
In hands-free speech communication systems, the
acoustic environment can crucially affect the quality and
intelligibility of the speech signal acquired by the
microphone(s). In fact, the speech signal propagates through
the air and is reflected by the walls, the floor, the
ceiling, and any object in the room before being picked up
by the microphone(s). This propagation results in a signal
attenuation and spectral distortion, called reverberation,
that seriously degrades speech quality and intelligibility.
Many dereverberation techniques have thus been
proposed in the past (e.g., [
1–5
]). One direct way is to
estimate an inverse filter of the room impulse response
(RIR) [
6
] to deconvolve the reverberant signal. Wu and
Wang [
1
], Mosayyebpour [
2
] designed an inverse filter of
RIR by maximizing the kurtosis and skewness of the
linear prediction (LP) residual, respectively, to reduce early
reverberation. However, a minimum phase assumption
is often needed, which is almost never satisfied in
practice [
6
]. The RIR can also be varying in time and hard
to estimate [
7
]. Kinoshita et al. [
3
] estimated the late
reverberations using long-term multi-step linear
prediction, and then reduced the late reverberation effect by
employing spectral subtraction.
Recently, due to their strong regression capabilities,
deep neural networks (DNNs) [
8, 9
] have also been
utilized in speech dereverberation. In [
10, 11
], a
DNNbased single-microphone dereverberation system was
proposed by adopting a sigmoid activation function at
the output layer and min-max normalization of target
features. An improved DNN dereverberation system we
proposed recently [
12
] adopted a linear output layer
and globally normalized the target features into zero
mean and unit variance, achieving the state-of-the-art
performances.
Microphone array signal processing which utilizes
spatial information, is another fundamentally important way
for enhancement of speech acquisition in noisy
environment [
13, 14
]. It has recently been shown that the
use of the time-varying nature of speech signals could
achieve high-quality speech dereverberation based on
multi-channel linear prediction (MCLP) [
15–17
]. Its
efficient implementation method which performs in
timefrequency-domain, is often referred to as the weighted
prediction error (WPE) [
15, 16, 18
]. The work in [19]
designed a feed-forward neural network for mapping
microphone array’s spatial features into a T-F mask.
And [
20
] utilized a deep neural network (DNN) based
multichannel speech enhancement technique, where the
speec (...truncated)