Analysis on types of spelling errors in true Tibetan characters
MATEC Web of Conferences 336, 06019 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606019
Analysis on types of spelling errors in true
Tibetan characters
Maocuo San1,2,3,*, Zhijie Cai1,2,3,4 , Rangzhuoma Cai1,2,3,4, and Jizhaxi Dao1,2,3
1
College of Computer Science and Technology, Qinghai Normal University, Qinghai Xining, China
Key Laboratory of Tibetan Information Processing, Ministry of Education, Qinghai Xining, China
3
Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province,
Qinghai Xining, China
4
School of Computer Science and Technology,Southwest Minzu University, Sichuan Chengdu
610041, China
2
Abstract. Spelling error checking is a challenging research topic with a
wide range of applications such as text editing, word processing, spell
checking, teaching, etc. As an alphabetic language, spelling errors in
Tibetan could be categorized into three types, namely, non-true type, true
type, and punctuation misuse. In order to study true Tibetan syllable
spelling error in much more depth, the article analyses the types of True
Tibetan syllable spelling errors based on Tibetan word formation rules,
grammar and semantic features laying a foundation for Tibetan spelling
error checking research.
1 Introduction
With the rapid growth in the amount of information in Tibetan texts available online,
Tibetan spelling error checking has become an urgent demand, raising huge interests in
related research and application in the community. Given the fact that the more detailed and
thorough the analysis of the types of spelling errors is, the more effective the design of
spell-checking strategies will be, analyzing the types of errors in Tibetan texts,
summarizing and categorizing the rules and commonalities of spelling errors are essential
for developing in-depth and effective spelling checking methods. The spelling of Tibetan
text includes three aspects: non-true characters, true characters, and punctuation. In recent
years, researchers have conducted research on the spelling check of Tibetan non-true
characters, and many valuable research results have been obtained [1-3]. Tibetan
true-character spelling checking is also an important part of Tibetan text spelling checking,
and scholars have also begun to pay attention to the research of true-character spelling
checking. The analysis of the types of errors in the Tibetan true-character spelling check is
the basic work of the true-character spelling check, but there are no related documents,
which affects the development of the spell check technology of Tibetan text. This article
takes Tibetan word formation rules, grammar and semantics as the starting point, analyzes
the types of spelling errors in Tibetan true characters, and provides data support for the
study of Tibetan true characters spelling checking technology.
*
Corresponding author:
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences 336, 06019 (2021)
CSCNS2020
https://doi.org/10.1051/matecconf/202133606019
2 Research status
In 1967, British linguist Corder [4] proposed the concept of error analysis for the first time.
He systematically analyzed the errors in the collected text corpus, and studied the nature
and types of errors, which opened the era of text error analysis. Due to the complexity of
the language itself, there are many types of text errors, and it is difficult to analyze the types
of text errors. In order to analyze the types of spell check errors in depth, the Association
for Computational Linguistics (ACL) has established a Natural Language Learning Special
Interest Group (CoNLL) to discuss the analysis of spell check error types. The goal of
CoNLL-2014 [5] is to automatically detect all types of grammatical errors in short English
texts written by non-native English speakers and return the corrected text. Inspired by the
shared task of analyzing the types of spell checking in English, a lot of researches on the
analysis of types of spell-checking errors have been established in China, and this field has
received extensive attention from researchers. The International Natural Language
Processing and Chinese Computing Conference NLPCC has added a Chinese grammatical
error correction task with the goal to detect and correct grammatical errors in Chinese
sentences written by non-native Chinese speakers [6]. At the NLPCC2018 evaluation
sessions, six teams from the Alibaba, Peking University and other institutions achieved
good results. In 2018, Tan et al. analyzed five types of noun singular and plural errors, verb
form errors, subject-predicate inconsistency errors, article errors, and preposition errors that
ESL learners often make, and proposed a method based on LSTM and N-Grammatical error
correction method [7]. In 2020, Liang et al. classified and analyzed the spelling errors of
English learners, and designed an automatic spelling check system for the corresponding
types [8].
Since the beginning of the 21st century, scholars have begun to analyze Tibetan spelling
errors, mainly focusing on the analysis of non-truth spelling check types. In 2009, Dorje
Dolma elaborated on the diversity of spelling errors in Tibetan texts, and used the n-gram
model to solve the problem of checking Tibetan syllables [9]. In 2011, Guan Bai analyzed
the types of errors in Tibetan characters and designed a method of proofreading the
corresponding Tibetan syllable characters [10]. In 2013, Zhu Jie et al. discussed the spelling
check of Tibetan syllables, the error check of Sanskrit transliteration, the check of
continuous relations and the error check of Tibetan words based on the five defined types of
Tibetan text errors, text proofreading system [2]. In 2017, Liu et al. calculated the types of
spelling errors of non-true characters on the corpus containing more than 90 million
syllables on Tibetan web pages according to predetermined rules, and analyzed the causes
of the spelling errors [3]. The analysis of the types of errors in the Tibetan true-character
spelling check is the basic work of the true-character spelling check, but there is no relevant
literature yet. This article takes Tibetan word formation rules, grammar and semantics as
the starting point, analyzes the types of spelling errors in Tibetan true characters, and
provides data support for the study of Tibetan true characters spelling checking technology.
3 Types of spelling errors in true Tibetan characters
3.1 Classification of spelling errors in Tibetan text
Tibetan is composed of letters as syllables, syllables as words, words as phrases, and
phrases as sentences. Therefore, there are spelling errors at the letter-level, word-level,
grammatical-level, semantic-level and punctuation. Non-true character errors refer to
Tibetan typos that do not conform to the Tibetan grammar. For example (...truncated)