Use of prompt-based learning for code-mixed and code-switched text classification (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-024-01302-2.pdf

Use of prompt-based learning for code-mixed and code-switched text classification

World Wide Web (2024) 27:63 https://doi.org/10.1007/s11280-024-01302-2 Use of prompt-based learning for code-mixed and code-switched text classification Pasindu Udawatta1 · Indunil Udayangana1 · Chathulanka Gamage1 · Ravi Shekhar2 · Surangika Ranathunga3 Received: 10 April 2024 / Revised: 18 August 2024 / Accepted: 20 August 2024 © The Author(s) 2024 Abstract Code-mixing and code-switching (CMCS) are prevalent phenomena observed in social media conversations and various other modes of communication. When developing applications such as sentiment analysers and hate-speech detectors that operate on this social media data, CMCS text poses challenges. Recent studies have demonstrated that prompt-based learning of pre-trained language models outperforms full fine-tuning across various tasks. Despite the growing interest in classifying CMCS text, the effectiveness of prompt-based learning for the task remains unexplored. This paper presents an extensive exploration of prompt-based learning for CMCS text classification and the first comprehensive analysis of the impact of the script on classifying CMCS text. Our study reveals that the performance in classifying CMCS text is significantly influenced by the inclusion of multiple scripts and the intensity of codemixing. In response, we introduce a novel method, Dynamic+AdapterPrompt, which employs distinct models for each script, integrated with adapters. While DynamicPrompt captures the script-specific representation of the text, AdapterPrompt emphasizes capturing the taskoriented functionality. Our experiments on Sinhala-English, Kannada-English, and HindiEnglish datasets for sentiment classification, hate-speech detection, and humour detection tasks show that our method outperforms strong fine-tuning baselines and basic prompting strategies. Keywords Code-mixing · Code-switching · Prompt-based learning · Pre-trained language models · XLM-R · Text classification · Language script · Adapters · Sinhala · Kannada · Hindi 1 Introduction Code-mixing involves borrowing words from one language and incorporating them into another without affecting the context [1, 2]. Code-switching, or language alternation, occurs when individuals alternate between two or more languages within a single conversation or situation [3]. In the context of code-mixed and code-switched (CMCS) text, we distinguish Extended author information available on the last page of the article 0123456789().: V,-vol 123 63 Page 2 of 31 World Wide Web (2024) 27:63 two subtypes: (1) text comprising words that alternate between two languages, and (2) text transitioning from one script to another by substituting letters in a predictable manner, known as Transliteration [4]. Code-mixing and code-switching are intricate phenomena of linguistic behaviour, characterized by the intentional or spontaneous alternation of languages within a single discourse. Another characteristic of CMCS data is lexical borrowing, where words or phrases from one language are used in another. Grammatical hybridity [5], a distinct feature of CMCS, results in blending grammatical structures from different languages. Furthermore, CMCS is influenced by linguistic, social, and cultural constraints, leading to a specific contextual framework. CMCS is commonly observed in online conversations. A thorough understanding of CMCS data is pivotal for effective communication, advertising, sentiment analysis, and fostering inclusivity across language boundaries. However, the inherent characteristics of CMCS data introduce unique challenges to NLP systems. In particular, the inclusion of multiple scripts and lexical patterns and the potential misidentification of transliterated tokens pose challenges even to modern Natural Language Processing (NLP) systems when processing such text. These challenges are particularly pronounced when working with low-resource languages [6, 7]. In recent years, the domain of NLP has witnessed remarkable advancements, notably propelled by the emergence of pre-trained language models (PLMs) [8, 9]. These PLMs have been trained on extensive datasets, preserving a task-agnostic stance regarding the specific tasks for which they will be later used. To leverage the extensive knowledge embedded in PLMs for diverse NLP tasks, the PLM has to be fine-tuned with task-specific data [10]. This “pre-train and fine-tune” paradigm has been able to activate and harness the comprehensive knowledge within PLMs, leading to very promising results across various downstream tasks such as text classification and named entity recognition [10, 11]. On the negative side, this paradigm faces challenges due to the disparity between pre-training and fine-tuning objectives, leading to inefficiencies in utilizing PLMs across diverse tasks, as they may be unstable in low-resource settings, and less transferable to new tasks after fine-tuning [10–13]. Prompt-based learning has recently been demonstrated to yield promising results compared to full fine-tuning of PLMs for many downstream tasks [13], even in low-resource scenarios [14]. This paradigm involves redefining downstream tasks using textual prompts, encompassing both prompt engineering and answer engineering [11]. In contrast to finetuning, prompt-based learning leverages the existing knowledge of PLMs by redefining downstream tasks as pre-training objectives [10, 11, 15]. This removes the need for extensive parameter updates in PLMs, thus preserving their transferability across various tasks. Prompt-based learning has been extended to incorporate pre-trained multilingual language models (PMLMs) as well, enabling experimentation in languages beyond English [16–18]. Existing research on CMCS text classification mainly focuses on the full fine-tuning of PMLMs for downstream tasks [6, 19]. On the other hand, while prompt-based learning has shown success over full fine-tuning for monolingual text, its application to CMCS data has not been explored. Given that prompt-based learning relies on textual prompts, designing effective prompts for CMCS text remains an open question. In other words, a prompt formulated in one language might not be suitable for effectively classifying CMCS data. The absence of multilingual prompts poses a challenge in inducing knowledge from PMLMs effectively, and the potential misidentification of transliterated tokens adds further complexity to accurate classification. These challenges are even more pronounced for low-resource languages. Therefore, addressing these unique challenges is crucial for advancing CMCS text classification through prompt-based learning. 123 World Wide Web (2024) 27:63 Page 3 of 31 63 In this study, we focus on prompt-based learning for CMCS text classification. To the best of our knowledge, we believe that we are the first to explore prompt-based learning for CMCS text classification. Therefore, we first delve into the challenges surrounding CMCS text classification and the intricacies introduc (...truncated)