Analysis on types of spelling errors in true Tibetan characters

MATEC Web of Conferences, Jan 2021

Spelling error checking is a challenging research topic with a wide range of applications such as text editing, word processing, spell checking, teaching, etc. As an alphabetic language, spelling errors in Tibetan could be categorized into three types, namely, non-true type, true type, and punctuation misuse. In order to study true Tibetan syllable spelling error in much more depth, the article analyses the types of Tr u e Tibetan syllable spelling errors based on Tibetan word formation rules, grammar and semantic features laying a foundation for Tibetan spelling error checking research.

Analysis on types of spelling errors in true Tibetan characters

MATEC Web of Conferences 336, 06019 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606019 Analysis on types of spelling errors in true Tibetan characters Maocuo San1,2,3,*, Zhijie Cai1,2,3,4 , Rangzhuoma Cai1,2,3,4, and Jizhaxi Dao1,2,3 1 College of Computer Science and Technology, Qinghai Normal University, Qinghai Xining, China Key Laboratory of Tibetan Information Processing, Ministry of Education, Qinghai Xining, China 3 Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province, Qinghai Xining, China 4 School of Computer Science and Technology,Southwest Minzu University, Sichuan Chengdu 610041, China 2 Abstract. Spelling error checking is a challenging research topic with a wide range of applications such as text editing, word processing, spell checking, teaching, etc. As an alphabetic language, spelling errors in Tibetan could be categorized into three types, namely, non-true type, true type, and punctuation misuse. In order to study true Tibetan syllable spelling error in much more depth, the article analyses the types of True Tibetan syllable spelling errors based on Tibetan word formation rules, grammar and semantic features laying a foundation for Tibetan spelling error checking research. 1 Introduction With the rapid growth in the amount of information in Tibetan texts available online, Tibetan spelling error checking has become an urgent demand, raising huge interests in related research and application in the community. Given the fact that the more detailed and thorough the analysis of the types of spelling errors is, the more effective the design of spell-checking strategies will be, analyzing the types of errors in Tibetan texts, summarizing and categorizing the rules and commonalities of spelling errors are essential for developing in-depth and effective spelling checking methods. The spelling of Tibetan text includes three aspects: non-true characters, true characters, and punctuation. In recent years, researchers have conducted research on the spelling check of Tibetan non-true characters, and many valuable research results have been obtained [1-3]. Tibetan true-character spelling checking is also an important part of Tibetan text spelling checking, and scholars have also begun to pay attention to the research of true-character spelling checking. The analysis of the types of errors in the Tibetan true-character spelling check is the basic work of the true-character spelling check, but there are no related documents, which affects the development of the spell check technology of Tibetan text. This article takes Tibetan word formation rules, grammar and semantics as the starting point, analyzes the types of spelling errors in Tibetan true characters, and provides data support for the study of Tibetan true characters spelling checking technology. * Corresponding author: © The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/). MATEC Web of Conferences 336, 06019 (2021) CSCNS2020 https://doi.org/10.1051/matecconf/202133606019 2 Research status In 1967, British linguist Corder [4] proposed the concept of error analysis for the first time. He systematically analyzed the errors in the collected text corpus, and studied the nature and types of errors, which opened the era of text error analysis. Due to the complexity of the language itself, there are many types of text errors, and it is difficult to analyze the types of text errors. In order to analyze the types of spell check errors in depth, the Association for Computational Linguistics (ACL) has established a Natural Language Learning Special Interest Group (CoNLL) to discuss the analysis of spell check error types. The goal of CoNLL-2014 [5] is to automatically detect all types of grammatical errors in short English texts written by non-native English speakers and return the corrected text. Inspired by the shared task of analyzing the types of spell checking in English, a lot of researches on the analysis of types of spell-checking errors have been established in China, and this field has received extensive attention from researchers. The International Natural Language Processing and Chinese Computing Conference NLPCC has added a Chinese grammatical error correction task with the goal to detect and correct grammatical errors in Chinese sentences written by non-native Chinese speakers [6]. At the NLPCC2018 evaluation sessions, six teams from the Alibaba, Peking University and other institutions achieved good results. In 2018, Tan et al. analyzed five types of noun singular and plural errors, verb form errors, subject-predicate inconsistency errors, article errors, and preposition errors that ESL learners often make, and proposed a method based on LSTM and N-Grammatical error correction method [7]. In 2020, Liang et al. classified and analyzed the spelling errors of English learners, and designed an automatic spelling check system for the corresponding types [8]. Since the beginning of the 21st century, scholars have begun to analyze Tibetan spelling errors, mainly focusing on the analysis of non-truth spelling check types. In 2009, Dorje Dolma elaborated on the diversity of spelling errors in Tibetan texts, and used the n-gram model to solve the problem of checking Tibetan syllables [9]. In 2011, Guan Bai analyzed the types of errors in Tibetan characters and designed a method of proofreading the corresponding Tibetan syllable characters [10]. In 2013, Zhu Jie et al. discussed the spelling check of Tibetan syllables, the error check of Sanskrit transliteration, the check of continuous relations and the error check of Tibetan words based on the five defined types of Tibetan text errors, text proofreading system [2]. In 2017, Liu et al. calculated the types of spelling errors of non-true characters on the corpus containing more than 90 million syllables on Tibetan web pages according to predetermined rules, and analyzed the causes of the spelling errors [3]. The analysis of the types of errors in the Tibetan true-character spelling check is the basic work of the true-character spelling check, but there is no relevant literature yet. This article takes Tibetan word formation rules, grammar and semantics as the starting point, analyzes the types of spelling errors in Tibetan true characters, and provides data support for the study of Tibetan true characters spelling checking technology. 3 Types of spelling errors in true Tibetan characters 3.1 Classification of spelling errors in Tibetan text Tibetan is composed of letters as syllables, syllables as words, words as phrases, and phrases as sentences. Therefore, there are spelling errors at the letter-level, word-level, grammatical-level, semantic-level and punctuation. Non-true character errors refer to Tibetan typos that do not conform to the Tibetan grammar. For example (...truncated)


This is a preview of a remote PDF: https://www.matec-conferences.org/articles/matecconf/pdf/2021/05/matecconf_cscns20_06019.pdf
Article home page: https://doaj.org/article/d1e05e84ef6446bca9d0da1f5bbf7a84

San Maocuo, Cai Zhijie, Cai Rangzhuoma, Dao Jizhaxi. Analysis on types of spelling errors in true Tibetan characters, MATEC Web of Conferences, 2021, pp. 06019, Issue 336, DOI: 10.1051/matecconf/202133606019