Boosting Stemmer Performance Using Cache Method

Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA), Mar 2021

Stemming is the process of returning the word to the base word by disappearing the append. This is important to support better information re-meeting. Some research in stemming algorithms includes nazief &adriani algorithms, confix stripping, enhanced confix stripping, arifin and porter algorithms. The stemming algorithm for Bahasa Indonesia is divided into two, namely those that use a dictionary and do not use a dictionary. Some studies have shown that stemmers that use dictionary have high accuracy but low process speed, while stemmers that do not use dictionary have low accuracy but higher process speed. In this study, two methods were used: the stemmer method using cache and stemmer without cache to see the comparison of process speed from stemmers that use dictionary. The test data for this study is text data obtained from the corpus site. Process analysis is completed by calculating each speed, memory usage and cpu of each method, then each method is compared. Results from tests from test data showed that the cache method improved stemmer performance.

Article PDF cannot be displayed. You can download it here:

https://lldikti1.kemdikbud.go.id/jurnal/index.php/jumpa/article/download/34/14

Boosting Stemmer Performance Using Cache Method

Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA), 1 (1) (2021) 06-09 Published by: LLDIKTI WILAYAH 1 Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA) Journal homepage: www.lldikti1.ristekdikti.go.id/jurnal/index.php/jumpa Boosting Stemmer Performance Using Cache Method Muhammad Fadly Tanjung Fakultas Ilmu Komputer Dan Teknologi Informasi, Universitas Sumatera Utara, Indonesia Article Info Article history: Received: Jan 26, 2021 Revised : Feb 19, 2021 Accepted: Marc 22, 2021 Keywords: Performa Stemmer; Cache Method; Improve Performance. ABSTRACT Stemming is the process of returning the word to the base word by disappearing the append. This is important to support better information re-meeting. Some research in stemming algorithms includes nazief &adriani algorithms, confix stripping, enhanced confix stripping, arifin and porter algorithms. The stemming algorithm for Bahasa Indonesia is divided into two, namely those that use a dictionary and do not use a dictionary. Some studies have shown that stemmers that use dictionary have high accuracy but low process speed, while stemmers that do not use dictionary have low accuracy but higher process speed. In this study, two methods were used: the stemmer method using cache and stemmer without cache to see the comparison of process speed from stemmers that use dictionary. The test data for this study is text data obtained from the corpus site. Process analysis is completed by calculating each speed, memory usage and cpu of each method, then each method is compared. Results from tests from test data showed that the cache method improved stemmer performance. This is an open access article under the CC BY-NC license. Corresponding Author: Muhammad Fadly Tanjung, Teknologi Informasi, Universitas Sumatera Utara, Jl. Dr. T. Mansur No.9, Padang Bulan, Medan Email: 1. INTRODUCTION Stemming algorithms for Bahasa Indonesia have been developed before, including the Nazief-Adriani algorithm and Porter's algorithm(Mardiana et al., 2016)(Nurida Ahsanti, 2016). Stemming algorithm used first to meet Indonesian is Nazief-Adriani algorithm, referring to Porter Stemmer algorithm used in English (Mardiana et al., 2016)(Hidayatullah et al., 2016). Stemming algorithm is developed to minimize the lack of deficiencies that exist (Zhao et al., 2007)(Jalbert & Weimer, 2008)(Xiang-zhou et al., 2004), after Nazief-Adriani algorithm next there is Vega algorithm, Arifin-Setiono algorithm and Confix Stripping Stemmer algorithm (Lee et al., 2007)(Baltussen et al., 2004). The effectiveness of stemming algorithms can be measured based on several parameters, such as process speed, accuracy, and minimizing stem errors (Jivani, 2011)(Al-Shammari & Lin, 2008)(Kumar & Rana, 2011). For example, such as the Nazief-Adriani Algorithm which has a relatively high accuracy of 92.8% but the process speed is fairly slow compared to other algorithms. Unlike other stemming algorithms whose process speed is higher but the accuracy is relatively lower (Jivani, 2011). The implementation of the cache technique itself is applied to the stemming algorithm so that the stemming speed can be improved(Pfaff et al., 2015)(Chakrabarti et al., 2003). Journal homepage: https://lldikti1.ristekdikti.go.id/jurnal/index.php/jumpa JUMPA e-ISSN 2807-3142  7 Previous research tested the accuracy of each stemming algorithm for Indonesian, namely Nazief & Adriani algorithm, Arifin & Setiono algorithm, Vega algorithm, and Ahmad, Yusoff, and Sembok algorithms. Further research was conducted by Asian J, with the theme Effective Techniques for Indonesian Text Retrieval(Adriani et al., 2007)(Fam & Grohs, 2007). In this study, researchers explained the system of re-meeting text information in its entirety to the stemming process and the algorithms used (Burris, 2011)(Hernandez, 2015). In this study, researchers tested the influence of cache algorithms on the efficiency and effectiveness of data exchange (Jing et al., 2013)(Chen et al., 2009). The results of this study in general that increased efficiency will decrease the effectiveness of search results(Auh & Menguc, 2005). The study tested the speed and accuracy comparison of two Indonesian stemming algorithms, the Stemming Porter Algorithm with the Nazief &Adriani Algorithm, and summed up the advantages and disadvantages of each of these algorithms. The study tested whether there was an effect of the Modifikati Enhanced Confix Stripping Stemmer algorithm on stemmer performance itself. 2. RESEARCH METHODS The data used is 2016 Indonesian corpus data from the site http://wortschatz.unileipzig.de/de/download. Leipzig Corpus is a site that collects data from various links and is summarized over the course of a year in the form of text files that can be downloaded directly (Quasthoff et al., 2014). In the system testing used corpus data that had been downloaded from the Leipzig Corpus website as much as 1 million words. The corpus data specification itself is data that has been crawled from 113,845 links on many sites in Indonesia. General Architecture, In the processing of text documents tested, there are several steps such as tokenization, which is dividing sentence by sentence into word for word (token) and beheading (stemming). After these steps are done, you will get the basic word in the document. Each of these stages will be explained in more detail in the next section. In this study, the author used the AlgorithmEnhanced Confix Stripping (ECS Stemmer) which is a development of Confix Stripping (CS stemmer) where CS stemmer is a stemming method for the Indonesian language introduced by Jelita Asian which refers to the Nazief Adriani Algorithm (Wahyudi et al., 2017). The general architecture that describes each step of the method used in this study is shown in Figure 1. Improve Stemmer Performance Using Cache Method (Muhammad Fadly Tanjung) 8 e-ISSN 2807-3142  Figure 1. General Architecture 3. Input file Number of words 1 2 3 4 5 6 10.000 50.000 100.000 200.000 500.000 1.000.000 RESULT AND DISCUSSION Process time (seconds) 0,59 0,99 1,48 2,51 5,59 10,71 Average Memory Usage (KB) Average CPU Usage(%) 2.864 6.548 10.388 64.484 68.200 79.700 0,1 0,1 4,17 5,56 6,67 7,7 The results of porter stemmer testing can be seen that the speed of the stem process is very high compared to the other three stemmer methods. In this study, stemmer porters did have high speeds but could not be compared to the three previous stemmer methods due to different stemming algorithms. Porter stemmer only does affix cutting without checking the dictionary because porter stemmer is not a dictionary-based algorithm (dictionary base). While the three stemmers above use the same stemmer, namely Enhanced Confix Stripping (ECS) Stemmer based dictionary(dictionary base). This study still included porter stemmer in testing to prove whether the cache method could exceed the speed of porter stemmer despite differen (...truncated)


This is a preview of a remote PDF: https://lldikti1.kemdikbud.go.id/jurnal/index.php/jumpa/article/download/34/14
Article home page: https://lldikti1.kemdikbud.go.id/jurnal/index.php/jumpa/article/view/34/14

Tanjung Muhammad Fadly. Boosting Stemmer Performance Using Cache Method, Jurnal Matematika Dan Ilmu Pengetahuan Alam LLDikti Wilayah 1 (JUMPA), 2021, pp. 6-9,