Tibetan speech synthesis based on an improved neural network (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.matec-conferences.org/articles/matecconf/pdf/2021/05/matecconf_cscns20_06012.pdf

Tibetan speech synthesis based on an improved neural network

MATEC Web of Conferences 336, 06012 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606012 Tibetan speech synthesis based on an improved neural network Yuntao Ding1,3,4,* , Rangzhuoma Cai 1,2,3,4, and Baojia Gong 1,3,4 1College of Computer Science and Technology, Qinghai Normal University, Qinghai Xining 810016, China 2School of Computer Science and Technology, Southwest Minzu University, Sichuan Chengdu 610041, China 3Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province,Qinghai Xining 810008, China 4Key Laboratory of Tibetan Information Processing, Ministry of Education, Qinghai Xining 810008, China Abstract. Nowadays, Tibetan speech synthesis based on neural network has become the mainstream synthesis method. Among them, the griffin-lim vocoder is widely used in Tibetan speech synthesis because of its relatively simple synthesis.Aiming at the problem of low fidelity of griffin-lim vocoder, this paper uses WaveNet vocoder instead of griffin-lim for Tibetan speech synthesis.This paper first uses convolution operation and attention mechanism to extract sequence features.And then uses linear projection and feature amplification module to predict mel spectrogram.Finally,use WaveNet vocoder to synthesize speech waveform. Experimental data shows that our model has a better performance in Tibetan speech synthesis. 1 Introduction The speech synthesis method based on neural network greatly reduces the error rate of speech synthesis because the neural network unit has independent learning and back propagation capabilities, and the synthesized speech is closer to the human voice. Therefore, the method of speech synthesis based on neural network has become the mainstream method of speech synthesis in the world [1, 2, 3, 4]. As an important part of Chinese information processing, Tibetan speech synthesis is also the key and difficulty of Tibetan intelligent human-computer interaction. Although it started late, it has gradually from the wave-splicing-based Tibetan speech synthesis [5] and the statistical parameter-based Tibetan speech synthesis [6] into Tibetan speech synthesis based on neural network[7,8].In 2019, the literature[7] first proposed speech synthesis based on neural networks, which brought Tibetan speech synthesis into a new era. Based on the literature [7], this paper proposes a Tibetan speech synthesis method based on improved neural network.By constructing an improved neural network,using WaveNet * Corresponding author: © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). MATEC Web of Conferences 336, 06012 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606012 vocoder [9] to synthesize Tibetan speech.Subjective and objective experiments show that our model has a better performance in Tibetan speech synthesis. 2 Improved neural network structure Due to Ando Tibetan has no tonal characteristics [10], and there are similar pronunciations in the 30 consonants, such as ཅ and ཇ, ཨ and འ, etc.In order to better distinguish similar pronunciations and make the synthesized Tibetan language more natural, this paper proposes an improved neural networks for Tibetan speech synthesis. The structure is mainly composed of three parts: sequence feature extraction module, spectrum prediction module and waveform synthesis module. Among them, the sequence feature extraction module extracts sequence feature information by performing a convolution operation on the preprocessed Tibetan word vector and assigning attention weight to it. The spectrum prediction module predicts the spectrum characteristics by performing nonlinear transformation on the characteristic information and using linear projection and convolution operations. The waveform synthesis module uses the self-return characteristics of WaveNet vocoder to recover the phase information, and then synthesize the speech waveform. The specific model components of this paper are shown in Figure 1 below: Fig. 1. Improved neural network. 2.1 Sequence feature extraction Sequence features are indispensable to the speech synthesis process. Therefore, this paper first uses character embedding to preprocess the sequence, then uses 3 convolution layers to initially extract sequence features, and finally uses attention mechanism to assign corresponding weights to sequence features to complete sequence feature extraction. 2.2 Spectrum prediction Considering mel spectrogram close to human auditory system, and as the lower-layer acoustic characteristics of the audio signal, it is more direct in speech synthesis [11]. Therefore, this paper chooses mel spectrogram as the spectrum feature to achieve spectrum prediction. In this paper, an autoregressive neural network is used to achieve multi-frame prediction of the spectrum. The main steps are as follows: 2 MATEC Web of Conferences 336, 06012 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606012 1) Predict a frame of spectrum vector through linear projection of the Sequence feature matrix; 2) Pass the spectrum vector into the post-net to amplify useful spectral feature information; 3) Pass the spectrum vector into the pre-net to achieve nonlinear transformation; 4) Combine spectrum matrix and sequence feature matrix as a new sequence feature matrix; After completing step 4), return to step 1) and repeat the steps until the mel spectrogram prediction is complete. 2.3 Waveform synthesis Compared with the griffin-lim vocoder to achieve Tibetan speech synthesis, the speech waveform is smoother and close to the original sound waveform. Therefore, this paper uses WaveNet vocoder in waveform synthesis. The internal structure of WaveNet is composed of causal convolution and one-dimensional convolution layer and dilated convolution layer and various gated activation functions (tanh, sigmoid, relu). Among them, causal convolution and dilated convolution are important components in WaveNet. Causal convolution ensures the timing of spectrum information, and dilated convolution can improve the receptive field of spectrum convolution. The following briefly introduces the specific process of WaveNet. First, it generates a new spectrum matrix from the predicted spectrum matrix through causal convolution, and then passes the spectrum matrix through dilated convolution and a series of gated activation functions to effectively make the neural network perform coarse-grained convolution. Secondly, use its own autoregressive characteristics to recover the lost phase information. Finally, the posterior probability of sampling points is output through the softmax function [12]. Among them, the autoregressive characteristic is to predict the t sampling point through previous t-1 sampling points, and its formula (1) is as follows: 𝑝𝑝𝑝𝑝(𝑥𝑥𝑥𝑥) = ∏𝑇𝑇𝑇𝑇𝑡𝑡𝑡𝑡=1 𝑝𝑝𝑝𝑝(𝒙𝒙𝒙𝒙𝑡𝑡𝑡𝑡 |𝒙𝒙𝒙𝒙1 , … , 𝒙𝒙𝒙𝒙𝑡𝑡𝑡𝑡−1 ) (1 (...truncated)