A language model for Amdo Tibetan speech recognition
MATEC Web of Conferences 336, 06016 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606016
A language model for Amdo Tibetan speech
recognition
Taiben Suan1, 3, 4, 5*, Rangzhuoma Cai1, 2, 4, 5, Zhijie Cai1, 2, 4, 5, Ba Zu1, 4, 5, and Baojia Gong1,
4, 5
1College
of Computer Science and Technology, Qinghai Normal University, Xining, Qinghai
810016, China
2School of Computer Science and Technology ,Southwest Minzu University, Sichuan Chengdu
610041, China
3
Xinlong county Meteorological Bureau, Xinlong county, Sichuan 626800 , China
4Qinghai Provincial Key Laboratory of Tibetan Information Processing and Machine Translation,
Xining, Qinghai 810008, China
5Key Laboratory of Tibetan Information Processing, Ministry of Education, Xining, Qinghai
810008, China
Abstract. We built a language model which is based on Transformer
network architecture, used attention mechanisms to dispensing with
recurrence and convalutions entirely. Through the transliteration of Tibetan
to International Phonetic Alphabets, the language model was trained using
the syllables and phonemes of the Tibetan word as modeling units to
predict corresponding Tibetan sentences according to the context semantics
of IPA. And it combined with the acoustic model as the Tibetan speech
recognition was compared with end-to-end Tibetan speech recognition.
1 Introduction
The research on the Tibetan language model is still in its infancy [1], and there are some
research based on the deep neural network [2-3] but least for speech recognition. In speech
recognition, the use of deep learning algorithms can achieve end-to-end speech recognition
with word or phrase as the modeling unit [4-7]. The neural network model is better than
traditional models in speech recognition but it depends on large volumes of data, requiring
a lot of speech data for training to realize its potential. Tibetan is a minority language with a
relatively small population in China. It is mainly divided into three dialects; U-Tsang,
Kamba, and Amdo. Thus the speech data required for the end-to-end Tibetan speech
recognition model training is more difficult to collect than the corpus of text data. Therefore,
Tibetan speech recognition still uses syllables or phonemes as modeling units, and the
combination of acoustic models and language models has a better performance. The content
of this paper is a language model for Amdo Tibetan speech recognition, how to transliterate
Tibetan sentences into corresponding IPA, and train language models using syllables or
phonemes as modeling units for speech recognition tasks.
*
Corresponding author: aiswoboo@gmail. com
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences 336, 06016 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606016
2 Transformer component
Transformer was originally used in the field of machine translation [8]. It is different from
RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) structures. It
uses a self-attention mechanism for relating different positions of sequence in order to
compute a representation of one word in sequence, and, at the same time, process the
sequence in parallel. It's entire model framework is completely built with attention
mechanism and feed-forward neural network, and Transformer's training speed and
performance are much better than RNN [9].
2.1 Multi-head attention
First, dot-product of the query sequence and all the keys sequence is divided by the scaling
factor
d k , and then a softmax function is applied to obtain the weights of the values
sequence to computing the scaled dot-product attention. The scaling factor plays an
adjustment role, so that the dot product grow large, resulting in pushing the softmax
function to an area with an extremely small gradient. The output matrix is:
Attention(Q, K , V ) = soft max(
QK T
)V
dk
(1)
Multi-head attention can be understood as performing Scaled Dot-product Attention
multiple times without sharing parameters, projecting Q , K , and V through h times
different linear transformations, and then concatenate different results, finally output
through a linear mapping. The multi-head attention compute as:
MultiHead (Q, KV ) = Concat (head1 ,..., head h )W O
Where
Q
Q
d model×d k
K
V
head i = Attention(QWi , KWi , VWi )
where the projections are parameter matrices:
Wi ∈ R
K
, Wi ∈ R
d model×d k
V
, Wi ∈ R
d model×dv
(2)
(3)
,W O ∈ R hd v ×d model
2.2 Feed-forward neural network and positional decoding
The feed-forward neural network consists of two linear transformations with a ReLU
activation in between.
(4)
FFN ( x) = max(0, xW1 + b1 )W2 + b2
x represents the input; W1 represents the parameter matrix of the first linear
transformation; b1 represents the bias vector of the first linear transformation; W2
represents the parameter matrix of the second linear transformation; b2 represents the bias
vector of the second linear transformation.
Transformer model does not contain any RNN and CNN structure, but the position of
each word is closely related to the final output. So encode the position of each word to
implement the model using sequence order information, The specific calculation formula is
expressed as:
PE ( pos,2i ) = sin( pos 1000 2i d model )
PE ( pos,2i + 1) = cos( pos 1000 2i d model )
2
(5)
(6)
MATEC Web of Conferences 336, 06016 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606016
pos is position of the word in the sequence; i represents the i-th dimension of the
word vector; d model is the dimension of the word vector. Use sin and cos to encode
where
position information, such an encoding method can express both the absolute and relative
position of the word.
3 Tibetan language model
3.1 Tibetan phonetic transcription
In order to make the language model predict Tibetan sentences according to the context
semantics of IPA, and then combine with the acoustic model to play a role in speech
recognition, we need to transliterate Tibetan into the corresponding IPA sequence as the
input of the Transformer. Tibetan words are used as the output of the transformer to
training model. Tibetan script is a horizontal and vertical two-dimensional phonetic script
composed of consonants and vowels. It is composed of 7 basic components according to
strict Tibetan grammar rules. According to the spelling order, they are Prefix, Superscript,
Root Consonant, Subscript, Vowel sign, Suffix, and Second Suffix [10]. There is a
many-to-one mapping relationship between Tibetan words and corresponding phonetic
symbols. Tibetan word are separated by a tsek ‘་’. Usually a Tibetan word is a syllable [11],
consisting of single or multiple consonants and monophones or a combination of
monophones and final consonants [12]. In this paper, the four components of the Tibetan
syllables:Prefix, Su (...truncated)