Exploring The Impact of Stemming on Text Topic-Based Classification Accuracy
JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION
Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641
https://jolcc.org/index.php/jolcc/index
Exploring The Impact of Stemming on Text Topic-Based Classification
Accuracy
Refat Aljumily
Independent researcher
Email:
Submission Track:
Received: 13-04-2024, Final Revision: 28-06-2024, Available Online: 30-06-2024
Copyright © 2024 Authors
This work is licensed under a Creative Commons Attribution-Share Alike 4.0
International License.
ABSTRACT
Text classification attempts to assign written texts to specific group types that share the
same linguistic features. One class of features that have been widely employed for a wide
range of classification tasks is lexical features. This study explores the impact of stemming
on text classification using lexical features. To explore, this study is based on a corpus of
thirty texts written by six authors with topics that focus on politics, history, science,
prose, sport, and food. These texts are stemmed using a light stemming algorithm. In
order to classify these texts according to the topic by means of lexical features, linear
hierarchical clustering and non-linear clustering (SOM) is carried out on the stemmed
and unstemmed texts. Although both clustering methods are able to classify texts by topic
with two models produce accurate and stable results, the results suggest that the impact
of a light stemming on the accuracy of text classification by topic is ineffectual. The
accuracy is neither increased nor decreased on the stemmed texts, whereby the
stemming algorithm helped reducing the dimensionality of feature vector space model.
Keywords: stemming, classification, clustering, hierarchical, SOM, topic, content words
INTRODUCTION
The task of quantitative topic classification of written texts has become popular
with the huge increase and the variety of written texts of all kinds which may vary
according to the use, subject matter, author’s knowledge, and textual varieties, or events.
All of this has led to the study of different text types, such as narrative, non- fiction, poetry
and so on, all with their own lexical and syntactic patterns. A quantitative topic
classification relies on methods developed in natural language processing and machine
learning to analyse textual documents. While textual documents must be converted into
a quantitative form prior to analysing them, several conceptual issues in data creation
may hinder any quantitative textual data analysis. For example, the text data can in
204
JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION
Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641
https://jolcc.org/index.php/jolcc/index
general be very sparse because of the large number of redundant lexical features. This
can be attributed to the fact the English language has several morphological variants of a
single word. Pre-processing procedures such as cleaning and preparing raw texts for
analysis, and word stemming are commonly carried out before applying an analytical
method to build a robust pattern. The principal is that it is essential to adjust text data by
removing repetition and transforming words to their common base or root form through
stemming. This is to reduce the dimensionality of the feature dimension to make it easier
to analyse and process text and help in grouping variations of words together, which can
be useful for tasks like text classification or clustering. However, word stemmer is known
to produce nonsense or incomplete words and this is very likely to skew the text data and
therefore the classification results based on it. By way of explanation, this study is based
on a corpus of thirty texts that focus on the topics of politics, history, science, prose, sport,
and food written by six authors. Multivariate analytical methods are used to extract a set
of lexical features that define each text so that the thirty texts can be classified using linear
hierarchical clustering and non-linear clustering method SOM. In topic classification by
lexical features, the time and complexity of classification process are two important
problems that affect data analysis. Although this is crucial, easy and short processing
should not be accepted at the cost of classification accuracy. As thus, this study is designed
to examine the impact of stemming on the text topic-based classification by analysing the
thirty texts with and without stemming to determine which courses are more accurate
than others. This will be discussed in detail in the subsequent sections.
Research Problems
Text classification attempts to assign written texts to specific group types that share
the same linguistic features. To do so, the basic or common approach to is to look at lexical
words and their frequencies in a given text. The analyst takes the text to be classified and
counts the frequencies of the words and select the most distinguishing words of a given
text, followed by some text pre-processing steps to keep the resulting data matrix of a
manageable size. Because lexical words and frequency play a role in text classification
based on clustering, this can cause conceptual issues in text data creation in at least two
ways: (1) the curse of dimensionality and (2) lexical redundancy/ambiguity.
Dimensionality is a key issue for data analysis in any given application (Moisl, 2015). In
this application the vector space model is used to represent texts and lexical features as
205
JOURNAL OF LINGUISTICS, CULTURE AND COMMUNICATION
Vol.02, No.02, 2024: December: 204-224, E-ISSN:2988-1641
https://jolcc.org/index.php/jolcc/index
vectors in a multi-dimensional space. Each dimension represents a unique lexical feature
frequency in the entire corpus of texts. For example, when analysing written texts verbs,
adjectives, nouns, adverbs, prefixes, suffixes, word length, word frequency, word cluster
and high frequency word distribution, etc could each be a dimension. Each dimension
corresponds to a unique feature, while the texts can be represented as a vector within
that space. As the number of lexical features increases, and thus the number of
dimensions, moves from low to high dimensional spaces, text data starts to behave
differently and make analytics more challenging, as shown in Figure (1) below.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 1 . Lexical features plotted on a 2- dimensional space
For example, lexical items such as ‘cat’, ‘cats’, ‘catty’, ‘cattery’, and so on which are
recognized as distinct lexical types or the morphological variants of the same word ‘CAT’
will be assigned four dimensions in the data matrix. If each of the four variables take
integer values in the range 1...10. The ratio of data points to possible values is 10/(10 x10
x10 x10) =0.001, that is, the data points occupy 0.1% of the data space. It is, therefore,
clear that lexical frequency text data will, in general, be very sparse on account of (...truncated)