Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis (pdf)

Article PDF cannot be displayed. You can download it here:

https://jurnal.itscience.org/index.php/brilliance/article/download/5679/4265

Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis

E-ISSN : 2807-9035 Volume 5, Number 1, May 2025 https://doi.org/10.47709/brilliance.v5i1.5679 Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis Fitrah Rumaisa1* 1 1 Universitas Widyatama, Jl Cikutra 204 A, Bandung, 40125, Indonesia *Corresponding Author Article History: Submitted: 13-03-2025 Accepted: 21-03-2025 Published: 28-03-2025 Keywords: Indonesian; stemming algorithms; text analysis; natural language processing; review. Brilliance: Research of Artificial Intelligence is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). ABSTRACT Bahasa Indonesia, with its rich linguistic structure and agglutinative morphology, presents significant challenges for natural language processing, particularly in stemming. Stemming is a crucial process in text analysis, aimed at reducing words to their root forms for better information retrieval and text classification. This study evaluates and compares several stemming algorithms developed specifically for the Indonesian language, including the Nazief and Adriani stemmer, Asian’s algorithm, Arifin and Setiono’s method, and the Enhanced Confix Stripping (ECS) stemmer. The research examines these algorithms based on accuracy, processing speed, and efficiency in handling affixes. The findings indicate that the ECS stemmer performs best in terms of accuracy, effectively handling complex affixation structures. The Nazief and Adriani algorithm follows closely, demonstrating robust affix removal but at a slower processing speed. Meanwhile, Arifin and Setiono’s algorithm provides a balance between accuracy and computational efficiency, while Asian’s algorithm, despite its statistical approach, is limited by its reliance on large training corpora. This study highlights the need for continuous refinement of Indonesian stemming algorithms to accommodate linguistic variations and evolving language usage. Future research should explore hybrid approaches integrating machine learning to enhance adaptability and precision. These advancements will contribute significantly to the development of more effective natural language processing tools for the Indonesian language. INTRODUCTION Indonesian language, also known as Bahasa Indonesia, is the official language of Indonesia and is spoken by over 200 million people. It belongs to the Austronesian language family and is heavily influenced by other languages such as Sanskrit, Arabic, Dutch, and Chinese. With its large number of speakers and diverse vocabulary, there has been a growing need for efficient text processing methods in Indonesian. Among these methods is stemming, a linguistic process that aims to reduce words to their root form or stem. Stemming has gained popularity in recent years due to its effectiveness in improving information retrieval systems, text classification, and natural language processing tasks. Stemming involves the removal of affixes from words to obtain their stem forms. Affixes are morphemes attached to the base word that can change its meaning or grammatical category. There are two types of affixes: prefixes attached at the beginning of a word, and suffixes attached at the end. One of the main challenges of stemming in Indonesian is the presence of agglutinative morphology. This means that words can have multiple affixes attached to them, resulting in very long words with multiple meanings. For example, the word "pengawasan" can be broken down into "pen-ga-was-an," where "pen" is a prefix denoting an agent or doer; "ga" indicates something related to supervision; "was" means watchful; and "an" signifies abstract nouns. Hence, this one word can have four different stems: pengawas (supervisor), pengwas (caretaker), pengawa (under surveillance), and awas (watch out). This complexity makes it challenging for traditional stemming algorithms developed for European languages to accurately handle Indonesian words. Stemming in the Indonesian language presents unique challenges due to its complex affixation system. Indonesian words often include prefixes, suffixes, infixes, and confixes, making it difficult to accurately reduce words to their root forms. Additionally, the language has many homographs—words that are spelled the same but have different meanings—further complicating the stemming process(Rumaisa et al., 2019). To address this issue, several researchers have proposed various stemming methods specifically tailored for Indonesian based on linguistic rules or machine learning techniques. Some examples include Rule-based Stemmer, Corpus-based Stemmer, and Hybrid Stemmer. These methods have shown promising results in terms of precision, recall and F1 score when evaluated on different datasets. The purpose of this research paper is not to determine which method is the best but to raise awareness about the complexities of stemming in Indonesian and provide insights for future improvements in this area. This review focuses on several prominent stemming algorithms that have been developed specifically for Indonesian, highlighting the contributions made by Nazief and Adriani, Asian, Arifin and Setiono, and the Enhanced Confix Stripping (ECS) stemmer. By examining these algorithms, we can better understand their methodologies, efficacy, and applications. This is an Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. 21 E-ISSN : 2807-9035 Volume 5, Number 1, May 2025 https://doi.org/10.47709/brilliance.v5i1.5679 LITERATURE REVIEW Stemming is a crucial preprocessing step in natural language processing (NLP) that reduces words to their root forms. In the context of the Indonesian language, stemming is particularly challenging due to its complex morphological structures. Various studies have explored different stemming algorithms to improve text processing accuracy. This literature review examines recent research on Indonesian stemming algorithms, focusing on their methodologies, advantages, and limitations. The first study of a non-deterministic approach to stemming is explored in this research to enhance accuracy. Instead of relying solely on rule-based methods, the algorithm generates multiple possible roots words and selects the most appropriate one based on linguistic context. This study shows improvements in stemming accuracy but notes that computational complexity increases due to multiple possible stem variations (Rifai, 2019). The next study applies the Nazief-Adriani stemming algorithm to detect word similarity in academic titles. By normalizing words to their base forms, the algorithm helps identify closely related titles, reducing redundancy and improving document classification. The study concludes that the Nazief-Adriani algorithm is effective for this application but may require additional enhancements for highly inflected words (Wisuda Sardjono et al., 2018). Then, the next research develops a system for checking Indonesia (...truncated)