Boosting Stemmer Performance Using Cache Method
Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA), 1 (1) (2021) 06-09
Published by: LLDIKTI WILAYAH 1
Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA)
Journal homepage: www.lldikti1.ristekdikti.go.id/jurnal/index.php/jumpa
Boosting Stemmer Performance Using Cache Method
Muhammad Fadly Tanjung
Fakultas Ilmu Komputer Dan Teknologi Informasi, Universitas Sumatera Utara, Indonesia
Article Info
Article history:
Received: Jan 26, 2021
Revised : Feb 19, 2021
Accepted: Marc 22, 2021
Keywords:
Performa Stemmer;
Cache Method;
Improve Performance.
ABSTRACT
Stemming is the process of returning the word to the base word by
disappearing the append. This is important to support better
information re-meeting. Some research in stemming algorithms
includes nazief &adriani algorithms, confix stripping, enhanced confix
stripping, arifin and porter algorithms. The stemming algorithm for
Bahasa Indonesia is divided into two, namely those that use a
dictionary and do not use a dictionary. Some studies have shown that
stemmers that use dictionary have high accuracy but low process
speed, while stemmers that do not use dictionary have low accuracy
but higher process speed. In this study, two methods were used: the
stemmer method using cache and stemmer without cache to see the
comparison of process speed from stemmers that use dictionary. The
test data for this study is text data obtained from the corpus site.
Process analysis is completed by calculating each speed, memory usage
and cpu of each method, then each method is compared. Results from
tests from test data showed that the cache method improved stemmer
performance.
This is an open access article under the CC BY-NC license.
Corresponding Author:
Muhammad Fadly Tanjung,
Teknologi Informasi,
Universitas Sumatera Utara,
Jl. Dr. T. Mansur No.9, Padang Bulan, Medan
Email:
1.
INTRODUCTION
Stemming algorithms for Bahasa Indonesia have been developed before, including the Nazief-Adriani
algorithm and Porter's algorithm(Mardiana et al., 2016)(Nurida Ahsanti, 2016). Stemming algorithm
used first to meet Indonesian is Nazief-Adriani algorithm, referring to Porter Stemmer algorithm
used in English (Mardiana et al., 2016)(Hidayatullah et al., 2016). Stemming algorithm is developed to
minimize the lack of deficiencies that exist (Zhao et al., 2007)(Jalbert & Weimer, 2008)(Xiang-zhou et
al., 2004), after Nazief-Adriani algorithm next there is Vega algorithm, Arifin-Setiono algorithm and
Confix Stripping Stemmer algorithm (Lee et al., 2007)(Baltussen et al., 2004). The effectiveness of
stemming algorithms can be measured based on several parameters, such as process speed, accuracy,
and minimizing stem errors (Jivani, 2011)(Al-Shammari & Lin, 2008)(Kumar & Rana, 2011). For
example, such as the Nazief-Adriani Algorithm which has a relatively high accuracy of 92.8% but the
process speed is fairly slow compared to other algorithms. Unlike other stemming algorithms whose
process speed is higher but the accuracy is relatively lower (Jivani, 2011). The implementation of the
cache technique itself is applied to the stemming algorithm so that the stemming speed can be
improved(Pfaff et al., 2015)(Chakrabarti et al., 2003).
Journal homepage: https://lldikti1.ristekdikti.go.id/jurnal/index.php/jumpa
JUMPA
e-ISSN 2807-3142
7
Previous research tested the accuracy of each stemming algorithm for Indonesian, namely
Nazief & Adriani algorithm, Arifin & Setiono algorithm, Vega algorithm, and Ahmad, Yusoff, and
Sembok algorithms. Further research was conducted by Asian J, with the theme Effective Techniques
for Indonesian Text Retrieval(Adriani et al., 2007)(Fam & Grohs, 2007). In this study, researchers
explained the system of re-meeting text information in its entirety to the stemming process and the
algorithms used (Burris, 2011)(Hernandez, 2015). In this study, researchers tested the influence of
cache algorithms on the efficiency and effectiveness of data exchange (Jing et al., 2013)(Chen et al.,
2009). The results of this study in general that increased efficiency will decrease the effectiveness of
search results(Auh & Menguc, 2005). The study tested the speed and accuracy comparison of two
Indonesian stemming algorithms, the Stemming Porter Algorithm with the Nazief &Adriani
Algorithm, and summed up the advantages and disadvantages of each of these algorithms. The study
tested whether there was an effect of the Modifikati Enhanced Confix Stripping Stemmer algorithm
on stemmer performance itself.
2.
RESEARCH METHODS
The data used is 2016 Indonesian corpus data from the site http://wortschatz.unileipzig.de/de/download. Leipzig Corpus is a site that collects data from various links and is
summarized over the course of a year in the form of text files that can be downloaded directly
(Quasthoff et al., 2014). In the system testing used corpus data that had been downloaded from the
Leipzig Corpus website as much as 1 million words. The corpus data specification itself is data that
has been crawled from 113,845 links on many sites in Indonesia.
General Architecture, In the processing of text documents tested, there are several steps
such as tokenization, which is dividing sentence by sentence into word for word (token) and
beheading (stemming). After these steps are done, you will get the basic word in the document. Each
of these stages will be explained in more detail in the next section. In this study, the author used the
AlgorithmEnhanced Confix Stripping (ECS Stemmer) which is a development of Confix Stripping (CS
stemmer) where CS stemmer is a stemming method for the Indonesian language introduced by Jelita
Asian which refers to the Nazief Adriani Algorithm (Wahyudi et al., 2017). The general architecture
that describes each step of the method used in this study is shown in Figure 1.
Improve Stemmer Performance Using Cache Method (Muhammad Fadly Tanjung)
8
e-ISSN 2807-3142
Figure 1. General Architecture
3.
Input file
Number of
words
1
2
3
4
5
6
10.000
50.000
100.000
200.000
500.000
1.000.000
RESULT AND DISCUSSION
Process
time
(seconds)
0,59
0,99
1,48
2,51
5,59
10,71
Average Memory Usage
(KB)
Average CPU
Usage(%)
2.864
6.548
10.388
64.484
68.200
79.700
0,1
0,1
4,17
5,56
6,67
7,7
The results of porter stemmer testing can be seen that the speed of the stem process is very
high compared to the other three stemmer methods. In this study, stemmer porters did have high
speeds but could not be compared to the three previous stemmer methods due to different stemming
algorithms. Porter stemmer only does affix cutting without checking the dictionary because porter
stemmer is not a dictionary-based algorithm (dictionary base). While the three stemmers above use
the same stemmer, namely Enhanced Confix Stripping (ECS) Stemmer based dictionary(dictionary
base). This study still included porter stemmer in testing to prove whether the cache method could
exceed the speed of porter stemmer despite differen (...truncated)