Quality prediction of synthesized speech based on tensor structured EEG signals (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0193521&type=printable

Quality prediction of synthesized speech based on tensor structured EEG signals

RESEARCH ARTICLE Quality prediction of synthesized speech based on tensor structured EEG signals Hayato Maki*, Sakriani Sakti, Hiroki Tanaka, Satoshi Nakamura Graduate School of Information Sciences, Nara Institue of Science and Technology, Ikoma, Nara, Japan * a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Maki H, Sakti S, Tanaka H, Nakamura S (2018) Quality prediction of synthesized speech based on tensor structured EEG signals. PLoS ONE 13(6): e0193521. https://doi.org/10.1371/journal. pone.0193521 Editor: Christos Papadelis, Boston Children’s Hospital / Harvard Medical School, UNITED STATES Abstract This study investigates quality prediction methods for synthesized speech using EEG. Training a predictive model using EEG is challenging due to a small number of training trials, a low signal-to-noise ratio, and a high correlation among independent variables. When a predictive model is trained with a machine learning algorithm, the features extracted from multi-channel EEG signals are usually organized as a vector and their structures are ignored even though they are highly structured signals. This study predicts the subjective rating scores of synthesized speeches, including their overall impression, valence, and arousal, by creating tensor structured features instead of vectorized ones to exploit the structure of the features. We extracted various features to construct a tensor feature that maintained their structure. Vectorized and tensorial features were used to predict the rating scales, and the experimental result showed that prediction with tensorial features achieved the better predictive performance. Among the features, the alpha and beta bands are particularly more effective for predictions than other features, which agrees with previous neurophysiological studies. Received: August 14, 2017 Accepted: February 13, 2018 Published: June 14, 2018 Copyright: © 2018 Maki et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The Physyqx data, which was used in this study, is third-party data from the following published paper: Gupta R, Banville HJ, Falk TH. PhySyQX: A database for physiological evaluation of synthesised speech quality-of-experience. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’15). 2015;1–5. DOI: 10.1109/WASPAA.2015.7336888. Researchers interested in accessing this data may contact Dr. Rishabh Gupta (grishabhg@gmail. com). Introduction Text-to-Speech (TTS) systems, which convert a written text into speech, and are becoming more widely implemented in mobile phones, car navigation systems, and other consumer electronics. Such systems play a critical role in many applications because speech is the most fundamental and easiest communication tool for human beings. Therefore, synthesized speeches must sound natural for good machine-to-human communications. The research of TTS systems needs reasonable criteria that evaluate the qualities of synthesized speeches. Several current evaluation methods have their own advantages and disadvantages: (1) subjective ratings [1–3], (2) analyzing a speech signal itself [4–6], and (3) measuring the physiological responses of listeners to speech [7–14]. In the first approach, the two most common aspects for quality judgment are naturalness and intelligibility. Naturalness describes how close synthesized speech is to human speech, and intelligibility reflects how well the speech content can be heard. The former is usually measured by a mean opinion score (MOS) test [1], and the latter is gauged by semantically unpredictable sentences (SUS) [3]. In addition, valence and arousal are often used to evaluate the PLOS ONE | https://doi.org/10.1371/journal.pone.0193521 June 14, 2018 1 / 13 Quality prediction of synthesized speech with EEG Funding: Part of this work was supported by JSPS KAKENHI (Grant Numbers JP17H06101 to SN, JP17K00237 to SS, and JP16K16172 to HT). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding received for this study. Competing interests: The authors have declared that no competing interests exist. subjective impressions of speech [11, 13, 15, 16] and to model emotions [17–20]. Valence reflects a positive or a negative emotion. Arousal reflects the degree of intensity or activation. In a MOS test, subjects listen to speech and rate its relative perceived quality on some kind of a scale, for example, “excellent,” “good,” “fair,” “poor,” “bad.” Then the scores are averaged across subjects. This is well established method for which references on how to perform it are available [2], making it the only standard way to evaluate the naturalness quality of synthesized speech. However, their appropriateness has not been fully proven because high inter- and intra-subject inconsistencies are often observed in the ratings, resulting in poor reproductivity [21]. In the second approach, speech quality is automatically evaluated at its signal level by software that inputs a speech file and outputs the estimated speech quality. Advantages of these methods include complete reproductivity and less time consumption after such software is developed. However, appropriateness is difficult to prove because the exact relationship between the acoustic features and the perceived quality of speech by a listener is not well understood [21]. In fact, speech quality must be evaluated not only physically but also psychologically because it is commonly defined as an assessment result within which a listener compares his/her perceptions with expectations [22, 23]. Last, quality estimation methods are emerging that measure the physiological responses of a listener [24]. Even though these methods have not been established yet, they are worth investigating because physiological signals can be recorded automatically and continuously to provide insight about listener’s cognitive states without interruptions caused by directly asking him/her to answer questions. Among existing non-invasive physiological response measures, electroencephalography (EEG) has especially great potential to estimate a listener’s perceived speech qualities for the following reasons. EEGs can be recorded at a higher temporal resolution, e.g., a millisecond range, than hemodynamic measures, including functional magnetic resonance imaging (fMRI) and functional near-infrared spectroscopy (fNIRS), both of which analyze the changes in blood flow that inherently take a few seconds until a brain response can be recorded. Temporal resolution is important to evaluate speech quality since the temporal structure of speech largely affects its perceived quality. In (...truncated)