Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis
E-ISSN : 2807-9035
Volume 5, Number 1, May 2025
https://doi.org/10.47709/brilliance.v5i1.5679
Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis
Fitrah Rumaisa1*
1
1
Universitas Widyatama, Jl Cikutra 204 A, Bandung, 40125, Indonesia
*Corresponding Author
Article History:
Submitted: 13-03-2025
Accepted: 21-03-2025
Published: 28-03-2025
Keywords:
Indonesian; stemming algorithms;
text analysis; natural language
processing; review.
Brilliance: Research of
Artificial Intelligence is licensed
under a Creative Commons
Attribution-NonCommercial 4.0
International (CC BY-NC 4.0).
ABSTRACT
Bahasa Indonesia, with its rich linguistic structure and agglutinative
morphology, presents significant challenges for natural language processing,
particularly in stemming. Stemming is a crucial process in text analysis, aimed
at reducing words to their root forms for better information retrieval and text
classification. This study evaluates and compares several stemming algorithms
developed specifically for the Indonesian language, including the Nazief and
Adriani stemmer, Asian’s algorithm, Arifin and Setiono’s method, and the
Enhanced Confix Stripping (ECS) stemmer. The research examines these
algorithms based on accuracy, processing speed, and efficiency in handling
affixes. The findings indicate that the ECS stemmer performs best in terms of
accuracy, effectively handling complex affixation structures. The Nazief and
Adriani algorithm follows closely, demonstrating robust affix removal but at a
slower processing speed. Meanwhile, Arifin and Setiono’s algorithm provides a
balance between accuracy and computational efficiency, while Asian’s
algorithm, despite its statistical approach, is limited by its reliance on large
training corpora. This study highlights the need for continuous refinement of
Indonesian stemming algorithms to accommodate linguistic variations and
evolving language usage. Future research should explore hybrid approaches
integrating machine learning to enhance adaptability and precision. These
advancements will contribute significantly to the development of more effective
natural language processing tools for the Indonesian language.
INTRODUCTION
Indonesian language, also known as Bahasa Indonesia, is the official language of Indonesia and is spoken by
over 200 million people. It belongs to the Austronesian language family and is heavily influenced by other languages
such as Sanskrit, Arabic, Dutch, and Chinese. With its large number of speakers and diverse vocabulary, there has been
a growing need for efficient text processing methods in Indonesian. Among these methods is stemming, a linguistic
process that aims to reduce words to their root form or stem.
Stemming has gained popularity in recent years due to its effectiveness in improving information retrieval
systems, text classification, and natural language processing tasks. Stemming involves the removal of affixes from
words to obtain their stem forms. Affixes are morphemes attached to the base word that can change its meaning or
grammatical category. There are two types of affixes: prefixes attached at the beginning of a word, and suffixes attached
at the end.
One of the main challenges of stemming in Indonesian is the presence of agglutinative morphology. This means
that words can have multiple affixes attached to them, resulting in very long words with multiple meanings. For
example, the word "pengawasan" can be broken down into "pen-ga-was-an," where "pen" is a prefix denoting an agent
or doer; "ga" indicates something related to supervision; "was" means watchful; and "an" signifies abstract nouns.
Hence, this one word can have four different stems: pengawas (supervisor), pengwas (caretaker), pengawa (under
surveillance), and awas (watch out). This complexity makes it challenging for traditional stemming algorithms
developed for European languages to accurately handle Indonesian words.
Stemming in the Indonesian language presents unique challenges due to its complex affixation system.
Indonesian words often include prefixes, suffixes, infixes, and confixes, making it difficult to accurately reduce words
to their root forms. Additionally, the language has many homographs—words that are spelled the same but have
different meanings—further complicating the stemming process(Rumaisa et al., 2019).
To address this issue, several researchers have proposed various stemming methods specifically tailored for
Indonesian based on linguistic rules or machine learning techniques. Some examples include Rule-based Stemmer,
Corpus-based Stemmer, and Hybrid Stemmer. These methods have shown promising results in terms of precision, recall
and F1 score when evaluated on different datasets.
The purpose of this research paper is not to determine which method is the best but to raise awareness about the
complexities of stemming in Indonesian and provide insights for future improvements in this area. This review focuses
on several prominent stemming algorithms that have been developed specifically for Indonesian, highlighting the
contributions made by Nazief and Adriani, Asian, Arifin and Setiono, and the Enhanced Confix Stripping (ECS)
stemmer. By examining these algorithms, we can better understand their methodologies, efficacy, and applications.
This is an Creative Commons License This work is licensed under a Creative
Commons Attribution-NonCommercial 4.0 International License.
21
E-ISSN : 2807-9035
Volume 5, Number 1, May 2025
https://doi.org/10.47709/brilliance.v5i1.5679
LITERATURE REVIEW
Stemming is a crucial preprocessing step in natural language processing (NLP) that reduces words to their root
forms. In the context of the Indonesian language, stemming is particularly challenging due to its complex
morphological structures. Various studies have explored different stemming algorithms to improve text processing
accuracy. This literature review examines recent research on Indonesian stemming algorithms, focusing on their
methodologies, advantages, and limitations.
The first study of a non-deterministic approach to stemming is explored in this research to enhance accuracy.
Instead of relying solely on rule-based methods, the algorithm generates multiple possible roots words and selects the
most appropriate one based on linguistic context. This study shows improvements in stemming accuracy but notes that
computational complexity increases due to multiple possible stem variations (Rifai, 2019).
The next study applies the Nazief-Adriani stemming algorithm to detect word similarity in academic titles. By
normalizing words to their base forms, the algorithm helps identify closely related titles, reducing redundancy and
improving document classification. The study concludes that the Nazief-Adriani algorithm is effective for this
application but may require additional enhancements for highly inflected words (Wisuda Sardjono et al., 2018).
Then, the next research develops a system for checking Indonesia (...truncated)