Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy (pdf)

Article PDF cannot be displayed. You can download it here:

https://jolcc.org/index.php/jolcc/article/download/51/61

Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy

JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641 https://jolcc.org/index.php/jolcc/index Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy Refat Aljumily Independent researcher Email: Submission Track: Received: 13-04-2024, Final Revision: 28-06-2024, Available Online: 30-06-2024 Copyright © 2024 Authors This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. ABSTRACT Text classification attempts to assign written texts to specific group types that share the same linguistic features. One class of features that have been widely employed for a wide range of classification tasks is lexical features. This study explores the impact of stemming on text classification using lexical features. To explore, this study is based on a corpus of thirty texts written by six authors with topics that focus on politics, history, science, prose, sport, and food. These texts are stemmed using a light stemming algorithm. In order to classify these texts according to the topic by means of lexical features, linear hierarchical clustering and non-linear clustering (SOM) is carried out on the stemmed and unstemmed texts. Although both clustering methods are able to classify texts by topic with two models produce accurate and stable results, the results suggest that the impact of a light stemming on the accuracy of text classification by topic is ineffectual. The accuracy is neither increased nor decreased on the stemmed texts, whereby the stemming algorithm helped reducing the dimensionality of feature vector space model. Keywords: stemming, classification, clustering, hierarchical, SOM, topic, content words INTRODUCTION The task of quantitative topic classification of written texts has become popular with the huge increase and the variety of written texts of all kinds which may vary according to the use, subject matter, author’s knowledge, and textual varieties, or events. All of this has led to the study of different text types, such as narrative, non- fiction, poetry and so on, all with their own lexical and syntactic patterns. A quantitative topic classification relies on methods developed in natural language processing and machine learning to analyse textual documents. While textual documents must be converted into a quantitative form prior to analysing them, several conceptual issues in data creation may hinder any quantitative textual data analysis. For example, the text data can in 204 JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641 https://jolcc.org/index.php/jolcc/index general be very sparse because of the large number of redundant lexical features. This can be attributed to the fact the English language has several morphological variants of a single word. Pre-processing procedures such as cleaning and preparing raw texts for analysis, and word stemming are commonly carried out before applying an analytical method to build a robust pattern. The principal is that it is essential to adjust text data by removing repetition and transforming words to their common base or root form through stemming. This is to reduce the dimensionality of the feature dimension to make it easier to analyse and process text and help in grouping variations of words together, which can be useful for tasks like text classification or clustering. However, word stemmer is known to produce nonsense or incomplete words and this is very likely to skew the text data and therefore the classification results based on it. By way of explanation, this study is based on a corpus of thirty texts that focus on the topics of politics, history, science, prose, sport, and food written by six authors. Multivariate analytical methods are used to extract a set of lexical features that define each text so that the thirty texts can be classified using linear hierarchical clustering and non-linear clustering method SOM. In topic classification by lexical features, the time and complexity of classification process are two important problems that affect data analysis. Although this is crucial, easy and short processing should not be accepted at the cost of classification accuracy. As thus, this study is designed to examine the impact of stemming on the text topic-based classification by analysing the thirty texts with and without stemming to determine which courses are more accurate than others. This will be discussed in detail in the subsequent sections. Research Problems Text classification attempts to assign written texts to specific group types that share the same linguistic features. To do so, the basic or common approach to is to look at lexical words and their frequencies in a given text. The analyst takes the text to be classified and counts the frequencies of the words and select the most distinguishing words of a given text, followed by some text pre-processing steps to keep the resulting data matrix of a manageable size. Because lexical words and frequency play a role in text classification based on clustering, this can cause conceptual issues in text data creation in at least two ways: (1) the curse of dimensionality and (2) lexical redundancy/ambiguity. Dimensionality is a key issue for data analysis in any given application (Moisl, 2015). In this application the vector space model is used to represent texts and lexical features as 205 JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641 https://jolcc.org/index.php/jolcc/index vectors in a multi-dimensional space. Each dimension represents a unique lexical feature frequency in the entire corpus of texts. For example, when analysing written texts verbs, adjectives, nouns, adverbs, prefixes, suffixes, word length, word frequency, word cluster and high frequency word distribution, etc could each be a dimension. Each dimension corresponds to a unique feature, while the texts can be represented as a vector within that space. As the number of lexical features increases, and thus the number of dimensions, moves from low to high dimensional spaces, text data starts to behave differently and make analytics more challenging, as shown in Figure (1) below. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 1 . Lexical features plotted on a 2- dimensional space For example, lexical items such as ‘cat’, ‘cats’, ‘catty’, ‘cattery’, and so on which are recognized as distinct lexical types or the morphological variants of the same word ‘CAT’ will be assigned four dimensions in the data matrix. If each of the four variables take integer values in the range 1...10. The ratio of data points to possible values is 10/(10 x10 x10 x10) =0.001, that is, the data points occupy 0.1% of the data space. It is, therefore, clear that lexical frequency text data will, in general, be very sparse on account of (...truncated)