A language model for Amdo Tibetan speech recognition (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.matec-conferences.org/articles/matecconf/pdf/2021/05/matecconf_cscns20_06016.pdf

A language model for Amdo Tibetan speech recognition

MATEC Web of Conferences 336, 06016 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606016 A language model for Amdo Tibetan speech recognition Taiben Suan1, 3, 4, 5*, Rangzhuoma Cai1, 2, 4, 5, Zhijie Cai1, 2, 4, 5, Ba Zu1, 4, 5, and Baojia Gong1, 4, 5 1College of Computer Science and Technology, Qinghai Normal University, Xining, Qinghai 810016, China 2School of Computer Science and Technology ,Southwest Minzu University, Sichuan Chengdu 610041, China 3 Xinlong county Meteorological Bureau, Xinlong county, Sichuan 626800 , China 4Qinghai Provincial Key Laboratory of Tibetan Information Processing and Machine Translation, Xining, Qinghai 810008, China 5Key Laboratory of Tibetan Information Processing, Ministry of Education, Xining, Qinghai 810008, China Abstract. We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition. 1 Introduction The research on the Tibetan language model is still in its infancy [1], and there are some research based on the deep neural network [2-3] but least for speech recognition. In speech recognition, the use of deep learning algorithms can achieve end-to-end speech recognition with word or phrase as the modeling unit [4-7]. The neural network model is better than traditional models in speech recognition but it depends on large volumes of data, requiring a lot of speech data for training to realize its potential. Tibetan is a minority language with a relatively small population in China. It is mainly divided into three dialects; U-Tsang, Kamba, and Amdo. Thus the speech data required for the end-to-end Tibetan speech recognition model training is more difficult to collect than the corpus of text data. Therefore, Tibetan speech recognition still uses syllables or phonemes as modeling units, and the combination of acoustic models and language models has a better performance. The content of this paper is a language model for Amdo Tibetan speech recognition, how to transliterate Tibetan sentences into corresponding IPA, and train language models using syllables or phonemes as modeling units for speech recognition tasks. * Corresponding author: aiswoboo@gmail. com © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). MATEC Web of Conferences 336, 06016 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606016 2 Transformer component Transformer was originally used in the field of machine translation [8]. It is different from RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) structures. It uses a self-attention mechanism for relating different positions of sequence in order to compute a representation of one word in sequence, and, at the same time, process the sequence in parallel. It's entire model framework is completely built with attention mechanism and feed-forward neural network, and Transformer's training speed and performance are much better than RNN [9]. 2.1 Multi-head attention First, dot-product of the query sequence and all the keys sequence is divided by the scaling factor d k , and then a softmax function is applied to obtain the weights of the values sequence to computing the scaled dot-product attention. The scaling factor plays an adjustment role, so that the dot product grow large, resulting in pushing the softmax function to an area with an extremely small gradient. The output matrix is: Attention(Q, K , V ) = soft max( QK T )V dk (1) Multi-head attention can be understood as performing Scaled Dot-product Attention multiple times without sharing parameters, projecting Q , K , and V through h times different linear transformations, and then concatenate different results, finally output through a linear mapping. The multi-head attention compute as: MultiHead (Q, KV ) = Concat (head1 ,..., head h )W O Where Q Q d model×d k K V head i = Attention(QWi , KWi , VWi ) where the projections are parameter matrices: Wi ∈ R K , Wi ∈ R d model×d k V , Wi ∈ R d model×dv (2) (3) ，W O ∈ R hd v ×d model 2.2 Feed-forward neural network and positional decoding The feed-forward neural network consists of two linear transformations with a ReLU activation in between. (4) FFN ( x) = max(0, xW1 + b1 )W2 + b2 x represents the input; W1 represents the parameter matrix of the first linear transformation; b1 represents the bias vector of the first linear transformation; W2 represents the parameter matrix of the second linear transformation; b2 represents the bias vector of the second linear transformation. Transformer model does not contain any RNN and CNN structure, but the position of each word is closely related to the final output. So encode the position of each word to implement the model using sequence order information, The specific calculation formula is expressed as: PE ( pos,2i ) = sin( pos 1000 2i d model ) PE ( pos,2i + 1) = cos( pos 1000 2i d model ) 2 (5) (6) MATEC Web of Conferences 336, 06016 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606016 pos is position of the word in the sequence; i represents the i-th dimension of the word vector; d model is the dimension of the word vector. Use sin and cos to encode where position information, such an encoding method can express both the absolute and relative position of the word. 3 Tibetan language model 3.1 Tibetan phonetic transcription In order to make the language model predict Tibetan sentences according to the context semantics of IPA, and then combine with the acoustic model to play a role in speech recognition, we need to transliterate Tibetan into the corresponding IPA sequence as the input of the Transformer. Tibetan words are used as the output of the transformer to training model. Tibetan script is a horizontal and vertical two-dimensional phonetic script composed of consonants and vowels. It is composed of 7 basic components according to strict Tibetan grammar rules. According to the spelling order, they are Prefix, Superscript, Root Consonant, Subscript, Vowel sign, Suffix, and Second Suffix [10]. There is a many-to-one mapping relationship between Tibetan words and corresponding phonetic symbols. Tibetan word are separated by a tsek ‘་’. Usually a Tibetan word is a syllable [11], consisting of single or multiple consonants and monophones or a combination of monophones and final consonants [12]. In this paper, the four components of the Tibetan syllables:Prefix, Su (...truncated)