Quality prediction of synthesized speech based on tensor structured EEG signals
RESEARCH ARTICLE
Quality prediction of synthesized speech
based on tensor structured EEG signals
Hayato Maki*, Sakriani Sakti, Hiroki Tanaka, Satoshi Nakamura
Graduate School of Information Sciences, Nara Institue of Science and Technology, Ikoma, Nara, Japan
*
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Maki H, Sakti S, Tanaka H, Nakamura S
(2018) Quality prediction of synthesized speech
based on tensor structured EEG signals. PLoS ONE
13(6): e0193521. https://doi.org/10.1371/journal.
pone.0193521
Editor: Christos Papadelis, Boston Children’s
Hospital / Harvard Medical School, UNITED
STATES
Abstract
This study investigates quality prediction methods for synthesized speech using EEG.
Training a predictive model using EEG is challenging due to a small number of training trials,
a low signal-to-noise ratio, and a high correlation among independent variables. When a
predictive model is trained with a machine learning algorithm, the features extracted from
multi-channel EEG signals are usually organized as a vector and their structures are ignored
even though they are highly structured signals. This study predicts the subjective rating
scores of synthesized speeches, including their overall impression, valence, and arousal, by
creating tensor structured features instead of vectorized ones to exploit the structure of the
features. We extracted various features to construct a tensor feature that maintained their
structure. Vectorized and tensorial features were used to predict the rating scales, and the
experimental result showed that prediction with tensorial features achieved the better predictive performance. Among the features, the alpha and beta bands are particularly more
effective for predictions than other features, which agrees with previous neurophysiological
studies.
Received: August 14, 2017
Accepted: February 13, 2018
Published: June 14, 2018
Copyright: © 2018 Maki et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: The Physyqx data,
which was used in this study, is third-party data
from the following published paper: Gupta R,
Banville HJ, Falk TH. PhySyQX: A database for
physiological evaluation of synthesised speech
quality-of-experience. Proceedings of IEEE
Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA’15). 2015;1–5.
DOI: 10.1109/WASPAA.2015.7336888.
Researchers interested in accessing this data may
contact Dr. Rishabh Gupta (grishabhg@gmail.
com).
Introduction
Text-to-Speech (TTS) systems, which convert a written text into speech, and are becoming
more widely implemented in mobile phones, car navigation systems, and other consumer electronics. Such systems play a critical role in many applications because speech is the most fundamental and easiest communication tool for human beings. Therefore, synthesized speeches
must sound natural for good machine-to-human communications.
The research of TTS systems needs reasonable criteria that evaluate the qualities of synthesized speeches. Several current evaluation methods have their own advantages and disadvantages: (1) subjective ratings [1–3], (2) analyzing a speech signal itself [4–6], and (3) measuring
the physiological responses of listeners to speech [7–14].
In the first approach, the two most common aspects for quality judgment are naturalness
and intelligibility. Naturalness describes how close synthesized speech is to human speech, and
intelligibility reflects how well the speech content can be heard. The former is usually measured by a mean opinion score (MOS) test [1], and the latter is gauged by semantically unpredictable sentences (SUS) [3]. In addition, valence and arousal are often used to evaluate the
PLOS ONE | https://doi.org/10.1371/journal.pone.0193521 June 14, 2018
1 / 13
Quality prediction of synthesized speech with EEG
Funding: Part of this work was supported by JSPS
KAKENHI (Grant Numbers JP17H06101 to SN,
JP17K00237 to SS, and JP16K16172 to HT). The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript. No additional external funding
received for this study.
Competing interests: The authors have declared
that no competing interests exist.
subjective impressions of speech [11, 13, 15, 16] and to model emotions [17–20]. Valence
reflects a positive or a negative emotion. Arousal reflects the degree of intensity or activation.
In a MOS test, subjects listen to speech and rate its relative perceived quality on some kind of a
scale, for example, “excellent,” “good,” “fair,” “poor,” “bad.” Then the scores are averaged
across subjects. This is well established method for which references on how to perform it are
available [2], making it the only standard way to evaluate the naturalness quality of synthesized
speech. However, their appropriateness has not been fully proven because high inter- and
intra-subject inconsistencies are often observed in the ratings, resulting in poor reproductivity
[21].
In the second approach, speech quality is automatically evaluated at its signal level by software that inputs a speech file and outputs the estimated speech quality. Advantages of these
methods include complete reproductivity and less time consumption after such software is
developed. However, appropriateness is difficult to prove because the exact relationship
between the acoustic features and the perceived quality of speech by a listener is not well
understood [21]. In fact, speech quality must be evaluated not only physically but also psychologically because it is commonly defined as an assessment result within which a listener compares his/her perceptions with expectations [22, 23].
Last, quality estimation methods are emerging that measure the physiological responses of
a listener [24]. Even though these methods have not been established yet, they are worth investigating because physiological signals can be recorded automatically and continuously to provide insight about listener’s cognitive states without interruptions caused by directly asking
him/her to answer questions. Among existing non-invasive physiological response measures,
electroencephalography (EEG) has especially great potential to estimate a listener’s perceived
speech qualities for the following reasons. EEGs can be recorded at a higher temporal resolution, e.g., a millisecond range, than hemodynamic measures, including functional magnetic
resonance imaging (fMRI) and functional near-infrared spectroscopy (fNIRS), both of which
analyze the changes in blood flow that inherently take a few seconds until a brain response can
be recorded. Temporal resolution is important to evaluate speech quality since the temporal
structure of speech largely affects its perceived quality. In (...truncated)