An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition

PLOS ONE, Dec 2019

Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0182580&type=printable

An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition

August An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition Alicia Lozano-Diez 0 1 Ruben Zazo 0 1 Doroteo T. Toledano 0 1 Joaquin Gonzalez-Rodriguez 0 1 0 Audias-UAM, Universidad Autonoma de Madrid , Madrid , Spain 1 Editor: Juan Tu, Nanjing University , CHINA Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance. Introduction The task of Language Recognition or Language Identification (LID) is defined as the task of identifying the language spoken in a given audio segment [ 1 ]. Automatic systems for LID aim to perform this task automatically, learning from a given dataset the necessary parameters to identify new spoken data. There are multiple applications of this technology as, for example, call centers that need to classify a call according to the language spoken, speech processing systems that deal with multilingual inputs, multimedia content indexing, or security applications such as tracking people depending on their language or accent. partir de la Voz (TEC2015-68172-C2-1-P). Both projects are funded by Ministerio de EconomÂõa y Competitividad, Spain. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Moreover, language recognition shares important modules with many other systems from closely related fields like speaker recognition (the task of identifying the person who is speaking in a given utterance), speech recognition (transcribe audio segments), or, in general, speech signal processing. Furthermore, not just the speech signal processing research area is involved, but also techniques from the machine learning field. In fact, the successful application and adaptation of machine learning tools is one of the main lines of research in language recognition nowadays. NIST language recognition evaluations Research in the field of language recognition has been driven to a large extent by the Language Recognition Evaluation (LRE) series organized by NIST (U.S. National Institute of Standards and Technology) approximately every two years since 1996 and up to 2015. This technology evaluations provide a common framework (making data available to all participants) to evaluate a given recognition task. Each participant sends results to the organization, which later provides comparative results, and final conclusions are shared during a workshop. Each evaluation differs in the specific tasks that participants have to address, such as dealing with different test duration, various languages, channel variability or noise conditions. The last two evaluations (corresponding to 2011 and 2015) have focused on the task of identifying similar languages (dialects or highly related languages), and, especially, testing short audio segments (less than 10 seconds) which has become a main concern nowadays. In particular, the last NIST LRE 2015 divided languages according to clusters of similar languages, which will be the task addressed in this work. As we already mentioned before, machine learning techniques conform a big research line in the field of language recognition. In this context, two of the evaluations organized by NIST in 2014 and 2015 [ 2, 3 ], known as i-vector challenges, skipped all the audio processing up to the i-vector (a fixed length vector re (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0182580&type=printable

Alicia Lozano-Diez, Ruben Zazo, Doroteo T. Toledano, Joaquin Gonzalez-Rodriguez. An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition, PLOS ONE, 2017, Volume 12, Issue 8, DOI: 10.1371/journal.pone.0182580