A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation

EURASIP Journal on Advances in Signal Processing, Dec 2017

A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs13634-017-0516-6.pdf

A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation

Wu et al. EURASIP Journal on Advances in Signal Processing A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation Bo Wu 0 3 Minglei Yang 0 3 Chin-Hui Lee 2 Kehuang Li 2 Zhen Huang 2 Sabato Marco Siniscalchi 1 2 Tong Wang 0 3 0 National Laboratory of Radar Signal Processing, Xidian University , Xi'an , China 1 Department of Telecommunications, University of Enna Kore , Enna , Italy 2 School of Electrical and Computer Engineering, Georgia Institute of Technology , Atlanta , USA 3 National Laboratory of Radar Signal Processing, Xidian University , Xi'an , China A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error. Deep neural networks (DNNs); Simultaneous speech dereverberation and beamforming; Auto-correlation function; Temporal and spatial contexts; Reverberation-time-aware (RTA) 1 Introduction In hands-free speech communication systems, the acoustic environment can crucially affect the quality and intelligibility of the speech signal acquired by the microphone(s). In fact, the speech signal propagates through the air and is reflected by the walls, the floor, the ceiling, and any object in the room before being picked up by the microphone(s). This propagation results in a signal attenuation and spectral distortion, called reverberation, that seriously degrades speech quality and intelligibility. Many dereverberation techniques have thus been proposed in the past (e.g., [ 1–5 ]). One direct way is to estimate an inverse filter of the room impulse response (RIR) [ 6 ] to deconvolve the reverberant signal. Wu and Wang [ 1 ], Mosayyebpour [ 2 ] designed an inverse filter of RIR by maximizing the kurtosis and skewness of the linear prediction (LP) residual, respectively, to reduce early reverberation. However, a minimum phase assumption is often needed, which is almost never satisfied in practice [ 6 ]. The RIR can also be varying in time and hard to estimate [ 7 ]. Kinoshita et al. [ 3 ] estimated the late reverberations using long-term multi-step linear prediction, and then reduced the late reverberation effect by employing spectral subtraction. Recently, due to their strong regression capabilities, deep neural networks (DNNs) [ 8, 9 ] have also been utilized in speech dereverberation. In [ 10, 11 ], a DNNbased single-microphone dereverberation system was proposed by adopting a sigmoid activation function at the output layer and min-max normalization of target features. An improved DNN dereverberation system we proposed recently [ 12 ] adopted a linear output layer and globally normalized the target features into zero mean and unit variance, achieving the state-of-the-art performances. Microphone array signal processing which utilizes spatial information, is another fundamentally important way for enhancement of speech acquisition in noisy environment [ 13, 14 ]. It has recently been shown that the use of the time-varying nature of speech signals could achieve high-quality speech dereverberation based on multi-channel linear prediction (MCLP) [ 15–17 ]. Its efficient implementation method which performs in timefrequency-domain, is often referred to as the weighted prediction error (WPE) [ 15, 16, 18 ]. The work in [19] designed a feed-forward neural network for mapping microphone array’s spatial features into a T-F mask. And [ 20 ] utilized a deep neural network (DNN) based multichannel speech enhancement technique, where the speec (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs13634-017-0516-6.pdf

Bo Wu, Minglei Yang, Kehuang Li, Zhen Huang, Sabato Marco Siniscalchi, Tong Wang, Chin-Hui Lee. A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation, EURASIP Journal on Advances in Signal Processing, 2017, pp. 81, Volume 2017, Issue 1, DOI: 10.1186/s13634-017-0516-6