Use of prompt-based learning for code-mixed and code-switched text classification
World Wide Web
(2024) 27:63
https://doi.org/10.1007/s11280-024-01302-2
Use of prompt-based learning for code-mixed
and code-switched text classification
Pasindu Udawatta1 · Indunil Udayangana1 · Chathulanka Gamage1 ·
Ravi Shekhar2 · Surangika Ranathunga3
Received: 10 April 2024 / Revised: 18 August 2024 / Accepted: 20 August 2024
© The Author(s) 2024
Abstract
Code-mixing and code-switching (CMCS) are prevalent phenomena observed in social media
conversations and various other modes of communication. When developing applications
such as sentiment analysers and hate-speech detectors that operate on this social media data,
CMCS text poses challenges. Recent studies have demonstrated that prompt-based learning
of pre-trained language models outperforms full fine-tuning across various tasks. Despite the
growing interest in classifying CMCS text, the effectiveness of prompt-based learning for
the task remains unexplored. This paper presents an extensive exploration of prompt-based
learning for CMCS text classification and the first comprehensive analysis of the impact of the
script on classifying CMCS text. Our study reveals that the performance in classifying CMCS
text is significantly influenced by the inclusion of multiple scripts and the intensity of codemixing. In response, we introduce a novel method, Dynamic+AdapterPrompt, which employs
distinct models for each script, integrated with adapters. While DynamicPrompt captures
the script-specific representation of the text, AdapterPrompt emphasizes capturing the taskoriented functionality. Our experiments on Sinhala-English, Kannada-English, and HindiEnglish datasets for sentiment classification, hate-speech detection, and humour detection
tasks show that our method outperforms strong fine-tuning baselines and basic prompting
strategies.
Keywords Code-mixing · Code-switching · Prompt-based learning · Pre-trained language
models · XLM-R · Text classification · Language script · Adapters · Sinhala · Kannada ·
Hindi
1 Introduction
Code-mixing involves borrowing words from one language and incorporating them into
another without affecting the context [1, 2]. Code-switching, or language alternation, occurs
when individuals alternate between two or more languages within a single conversation or
situation [3]. In the context of code-mixed and code-switched (CMCS) text, we distinguish
Extended author information available on the last page of the article
0123456789().: V,-vol
123
63
Page 2 of 31
World Wide Web
(2024) 27:63
two subtypes: (1) text comprising words that alternate between two languages, and (2) text
transitioning from one script to another by substituting letters in a predictable manner, known
as Transliteration [4].
Code-mixing and code-switching are intricate phenomena of linguistic behaviour, characterized by the intentional or spontaneous alternation of languages within a single discourse.
Another characteristic of CMCS data is lexical borrowing, where words or phrases from
one language are used in another. Grammatical hybridity [5], a distinct feature of CMCS,
results in blending grammatical structures from different languages. Furthermore, CMCS
is influenced by linguistic, social, and cultural constraints, leading to a specific contextual
framework.
CMCS is commonly observed in online conversations. A thorough understanding of
CMCS data is pivotal for effective communication, advertising, sentiment analysis, and
fostering inclusivity across language boundaries. However, the inherent characteristics of
CMCS data introduce unique challenges to NLP systems. In particular, the inclusion of multiple scripts and lexical patterns and the potential misidentification of transliterated tokens
pose challenges even to modern Natural Language Processing (NLP) systems when processing such text. These challenges are particularly pronounced when working with low-resource
languages [6, 7].
In recent years, the domain of NLP has witnessed remarkable advancements, notably propelled by the emergence of pre-trained language models (PLMs) [8, 9]. These PLMs have
been trained on extensive datasets, preserving a task-agnostic stance regarding the specific
tasks for which they will be later used. To leverage the extensive knowledge embedded in
PLMs for diverse NLP tasks, the PLM has to be fine-tuned with task-specific data [10]. This
“pre-train and fine-tune” paradigm has been able to activate and harness the comprehensive
knowledge within PLMs, leading to very promising results across various downstream tasks
such as text classification and named entity recognition [10, 11]. On the negative side, this
paradigm faces challenges due to the disparity between pre-training and fine-tuning objectives, leading to inefficiencies in utilizing PLMs across diverse tasks, as they may be unstable
in low-resource settings, and less transferable to new tasks after fine-tuning [10–13].
Prompt-based learning has recently been demonstrated to yield promising results compared to full fine-tuning of PLMs for many downstream tasks [13], even in low-resource
scenarios [14]. This paradigm involves redefining downstream tasks using textual prompts,
encompassing both prompt engineering and answer engineering [11]. In contrast to finetuning, prompt-based learning leverages the existing knowledge of PLMs by redefining
downstream tasks as pre-training objectives [10, 11, 15]. This removes the need for extensive parameter updates in PLMs, thus preserving their transferability across various tasks.
Prompt-based learning has been extended to incorporate pre-trained multilingual language
models (PMLMs) as well, enabling experimentation in languages beyond English [16–18].
Existing research on CMCS text classification mainly focuses on the full fine-tuning of
PMLMs for downstream tasks [6, 19]. On the other hand, while prompt-based learning has
shown success over full fine-tuning for monolingual text, its application to CMCS data has
not been explored. Given that prompt-based learning relies on textual prompts, designing
effective prompts for CMCS text remains an open question. In other words, a prompt formulated in one language might not be suitable for effectively classifying CMCS data. The
absence of multilingual prompts poses a challenge in inducing knowledge from PMLMs
effectively, and the potential misidentification of transliterated tokens adds further complexity to accurate classification. These challenges are even more pronounced for low-resource
languages. Therefore, addressing these unique challenges is crucial for advancing CMCS
text classification through prompt-based learning.
123
World Wide Web
(2024) 27:63
Page 3 of 31
63
In this study, we focus on prompt-based learning for CMCS text classification. To the
best of our knowledge, we believe that we are the first to explore prompt-based learning for
CMCS text classification. Therefore, we first delve into the challenges surrounding CMCS
text classification and the intricacies introduc (...truncated)