Tibetan speech synthesis based on an improved neural network
MATEC Web of Conferences 336, 06012 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606012
Tibetan speech synthesis based on an improved
neural network
Yuntao Ding1,3,4,* , Rangzhuoma Cai 1,2,3,4, and Baojia Gong 1,3,4
1College
of Computer Science and Technology, Qinghai Normal University, Qinghai Xining 810016,
China
2School of Computer Science and Technology, Southwest Minzu University, Sichuan Chengdu 610041,
China
3Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai
Province,Qinghai Xining 810008, China
4Key Laboratory of Tibetan Information Processing, Ministry of Education, Qinghai Xining 810008,
China
Abstract. Nowadays, Tibetan speech synthesis based on neural network
has become the mainstream synthesis method. Among them, the griffin-lim
vocoder is widely used in Tibetan speech synthesis because of its relatively
simple synthesis.Aiming at the problem of low fidelity of griffin-lim
vocoder, this paper uses WaveNet vocoder instead of griffin-lim for Tibetan
speech synthesis.This paper first uses convolution operation and attention
mechanism to extract sequence features.And then uses linear projection and
feature amplification module to predict mel spectrogram.Finally,use
WaveNet vocoder to synthesize speech waveform. Experimental data shows
that our model has a better performance in Tibetan speech synthesis.
1 Introduction
The speech synthesis method based on neural network greatly reduces the error rate of
speech synthesis because the neural network unit has independent learning and back
propagation capabilities, and the synthesized speech is closer to the human voice. Therefore,
the method of speech synthesis based on neural network has become the mainstream method
of speech synthesis in the world [1, 2, 3, 4].
As an important part of Chinese information processing, Tibetan speech synthesis is also
the key and difficulty of Tibetan intelligent human-computer interaction. Although it started
late, it has gradually from the wave-splicing-based Tibetan speech synthesis [5] and the
statistical parameter-based Tibetan speech synthesis [6] into Tibetan speech synthesis based
on neural network[7,8].In 2019, the literature[7] first proposed speech synthesis based on
neural networks, which brought Tibetan speech synthesis into a new era.
Based on the literature [7], this paper proposes a Tibetan speech synthesis method based
on improved neural network.By constructing an improved neural network,using WaveNet
*
Corresponding author:
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences 336, 06012 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606012
vocoder [9] to synthesize Tibetan speech.Subjective and objective experiments show that our
model has a better performance in Tibetan speech synthesis.
2 Improved neural network structure
Due to Ando Tibetan has no tonal characteristics [10], and there are similar pronunciations in
the 30 consonants, such as ཅ and ཇ, ཨ and འ, etc.In order to better distinguish similar
pronunciations and make the synthesized Tibetan language more natural, this paper proposes
an improved neural networks for Tibetan speech synthesis. The structure is mainly composed
of three parts: sequence feature extraction module, spectrum prediction module and
waveform synthesis module. Among them, the sequence feature extraction module extracts
sequence feature information by performing a convolution operation on the preprocessed
Tibetan word vector and assigning attention weight to it. The spectrum prediction module
predicts the spectrum characteristics by performing nonlinear transformation on the
characteristic information and using linear projection and convolution operations. The
waveform synthesis module uses the self-return characteristics of WaveNet vocoder to
recover the phase information, and then synthesize the speech waveform. The specific model
components of this paper are shown in Figure 1 below:
Fig. 1. Improved neural network.
2.1 Sequence feature extraction
Sequence features are indispensable to the speech synthesis process. Therefore, this paper
first uses character embedding to preprocess the sequence, then uses 3 convolution layers to
initially extract sequence features, and finally uses attention mechanism to assign
corresponding weights to sequence features to complete sequence feature extraction.
2.2 Spectrum prediction
Considering mel spectrogram close to human auditory system, and as the lower-layer
acoustic characteristics of the audio signal, it is more direct in speech synthesis [11].
Therefore, this paper chooses mel spectrogram as the spectrum feature to achieve spectrum
prediction.
In this paper, an autoregressive neural network is used to achieve multi-frame prediction
of the spectrum. The main steps are as follows:
2
MATEC Web of Conferences 336, 06012 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606012
1) Predict a frame of spectrum vector through linear projection of the Sequence feature
matrix;
2) Pass the spectrum vector into the post-net to amplify useful spectral feature
information;
3) Pass the spectrum vector into the pre-net to achieve nonlinear transformation;
4) Combine spectrum matrix and sequence feature matrix as a new sequence feature
matrix;
After completing step 4), return to step 1) and repeat the steps until the mel spectrogram
prediction is complete.
2.3 Waveform synthesis
Compared with the griffin-lim vocoder to achieve Tibetan speech synthesis, the speech
waveform is smoother and close to the original sound waveform. Therefore, this paper uses
WaveNet vocoder in waveform synthesis.
The internal structure of WaveNet is composed of causal convolution and
one-dimensional convolution layer and dilated convolution layer and various gated
activation functions (tanh, sigmoid, relu). Among them, causal convolution and dilated
convolution are important components in WaveNet. Causal convolution ensures the timing
of spectrum information, and dilated convolution can improve the receptive field of spectrum
convolution. The following briefly introduces the specific process of WaveNet.
First, it generates a new spectrum matrix from the predicted spectrum matrix through
causal convolution, and then passes the spectrum matrix through dilated convolution and a
series of gated activation functions to effectively make the neural network perform
coarse-grained convolution. Secondly, use its own autoregressive characteristics to recover
the lost phase information. Finally, the posterior probability of sampling points is output
through the softmax function [12]. Among them, the autoregressive characteristic is to
predict the t sampling point through previous t-1 sampling points, and its formula (1) is as
follows:
𝑝𝑝𝑝𝑝(𝑥𝑥𝑥𝑥) = ∏𝑇𝑇𝑇𝑇𝑡𝑡𝑡𝑡=1 𝑝𝑝𝑝𝑝(𝒙𝒙𝒙𝒙𝑡𝑡𝑡𝑡 |𝒙𝒙𝒙𝒙1 , … , 𝒙𝒙𝒙𝒙𝑡𝑡𝑡𝑡−1 )
(1 (...truncated)