An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition
August
An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition
Alicia Lozano-Diez 0 1
Ruben Zazo 0 1
Doroteo T. Toledano 0 1
Joaquin Gonzalez-Rodriguez 0 1
0 Audias-UAM, Universidad Autonoma de Madrid , Madrid , Spain
1 Editor: Juan Tu, Nanjing University , CHINA
Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance.
Introduction
The task of Language Recognition or Language Identification (LID) is defined as the task of
identifying the language spoken in a given audio segment [
1
]. Automatic systems for LID aim
to perform this task automatically, learning from a given dataset the necessary parameters to
identify new spoken data.
There are multiple applications of this technology as, for example, call centers that need to
classify a call according to the language spoken, speech processing systems that deal with
multilingual inputs, multimedia content indexing, or security applications such as tracking people
depending on their language or accent.
partir de la Voz (TEC2015-68172-C2-1-P). Both
projects are funded by Ministerio de EconomÂõa y
Competitividad, Spain. The funders had no role in
study design, data collection and analysis, decision
to publish, or preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Moreover, language recognition shares important modules with many other systems from
closely related fields like speaker recognition (the task of identifying the person who is
speaking in a given utterance), speech recognition (transcribe audio segments), or, in general,
speech signal processing. Furthermore, not just the speech signal processing research area is
involved, but also techniques from the machine learning field. In fact, the successful
application and adaptation of machine learning tools is one of the main lines of research in language
recognition nowadays.
NIST language recognition evaluations
Research in the field of language recognition has been driven to a large extent by the Language
Recognition Evaluation (LRE) series organized by NIST (U.S. National Institute of Standards
and Technology) approximately every two years since 1996 and up to 2015. This technology
evaluations provide a common framework (making data available to all participants) to
evaluate a given recognition task. Each participant sends results to the organization, which later
provides comparative results, and final conclusions are shared during a workshop.
Each evaluation differs in the specific tasks that participants have to address, such as dealing
with different test duration, various languages, channel variability or noise conditions.
The last two evaluations (corresponding to 2011 and 2015) have focused on the task of
identifying similar languages (dialects or highly related languages), and, especially, testing short
audio segments (less than 10 seconds) which has become a main concern nowadays. In
particular, the last NIST LRE 2015 divided languages according to clusters of similar languages,
which will be the task addressed in this work.
As we already mentioned before, machine learning techniques conform a big research line
in the field of language recognition. In this context, two of the evaluations organized by NIST
in 2014 and 2015 [
2, 3
], known as i-vector challenges, skipped all the audio processing up to the
i-vector (a fixed length vector re (...truncated)