Semiparametric Latent Topic Modeling on Consumer-Generated Corpora
Annals of Data Science
https://doi.org/10.1007/s40745-025-00587-y
ORIGINAL ARTICLE
Semiparametric Latent Topic Modeling
on Consumer-Generated Corpora
Dominic B. Dayta1
· Erniel B. Barrios2
Received: 17 October 2023 / Revised: 18 December 2024 / Accepted: 17 January 2025
© The Author(s) 2025
Abstract
Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards
reconstructing sparse topic structures that involve only a few critical words to aid
in interpretation. Considering the text typically contained in customer feedback, this
paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes
use of nonnegative matrix factorization to recover topic distributions based on word
co-occurrences and; (2) use semiparametric regression to identify factors driving the
expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach
provides a generative model that can be useful for predicting topics in new documents
based on these auxiliary variables, and is demonstrated to accurately identify topics
even for documents limited in length or size of vocabulary. In an application to real
customer feedback, the topics provided by our model are shown to be as interpretable
and useful for downstream analysis tasks as with those produced by current legacy
methods.
Keywords Topic modelling · Semiparametric regression · Latent
dirichlet allocation · Nonnegative matrix factorization · Customer complaint
B
Erniel B. Barrios
Dominic B. Dayta
1
School of Statistics, University of the Philippines Diliman, Quezon City, Philippines
2
School of Business, Monash University Malaysia, Selangor, Malaysia
123
Annals of Data Science
1 Introduction
The fields of natural language processing and information retrieval saw a productive
past two decades due largely to the emergence and worldwide adoption of two modern
technologies: large-scale document indexing and storage facilities, of which perhaps
the two most prominent brands are JSTOR and Google Books, and social networking
sites that allow individual users to create and distribute various types of content, a
considerable fraction of which exist in the form of texts (status updates, blog posts, and
tweets). All these have led to a relentless growth in information-rich but unstructured
collections of text data––referred to as corpora in natural language terminology––in
terms of volume, velocity, and frequency such that manual approaches to document
indexing and classification are quickly becoming obsolete.
Outside the context of online archives, methods that enable automated classification
and analysis of voluminous corpora would prove to be valuable technology. It has been
applied to legal research [1] and for analyzing patterns behind railroad accidents [2].
In the commercial space, companies can take advantage of thousands of posts being
contributed by users on a daily basis about their products and services on social media
and review aggregator websites like Yelp and TripAdvisor.
Among the core functions of Customer Relations Management (CRM) departments
in customer-facing industries is capturing what they call the Voice of the Customers
(VOC). VOC refers to feedback, self-reported by the customers in the form of verbatim
complaints, comments, inquiries, and the likes sent in via one or more points of capture.
Presently, the standard approach that industries have taken towards the capture and
analysis of VOC is via the employment of a Business Process Outsourcing (BPO)
partner, that would, in turn, deploy customer care representatives to receive and process
feedback. Representatives are trained in handling customers and are oriented towards
categorizing feedback according to subject. Through this arrangement, previously
unstructured data from call transcripts, e-mails, SMS, social media posts, and other
possible venues are transformed into structured (i.e., tabular) summaries which are
then sent to the client company for resolution, actions, and further analysis.
The proliferation of social networking and micro-blogging services on the Internet has given consumers an inexhaustible variety of platforms through which they
may voice out satisfaction or dissatisfaction towards these companies’ products and
services. All these have led to a relentless growth for VOC in terms of volume, velocity, and frequency such that manual approaches to feedback management are quickly
becoming inefficient. Successful formulation of a new methodology for automated
complaints classification would not only impact businesses directly concerned, but
also their outsourced service providers as this would ease the growing tedium of
manual feedback capture systems and allow for better, more strategic allocation and
management of manpower.
This need for automation is hardly novel in the literature. Hotel reviews on certain travel websites have been analyzed to the effect of identifying driving factors
to customer satisfaction [3]. This was accomplished by grouping together known
words appearing in the reviews under general tokens that identify thematic similarities between customers’ complaints and commendations. Other, more sophisticated
approaches involve the use of fuzzy algorithms [4].
123
Annals of Data Science
Both approaches can be seen as forerunners to the use topic modeling for analyzing
VOC, wherein the topic structures were defined a priori by the researchers (or, in the
latter case, through fuzzy logic). In true topic modeling, these structures are discovered
rather than pre-determined by the analyst, and this discovery provides the objective
of the algorithm. In [3] Latent Semantic Analysis (LSA), is used to extract linguistic
characteristics from customer complaints, and these characteristics were later used as
features in a classification model.
The method of Singular Value Decomposition (SVD) is used to discover underlying
“semantic structures” defined by word co-occurrences, [5]. This method, like the
others that would succeed it, depended on a specific representation of the corpora
into a matrix form that is much more suited for statistical analysis. By representing
each document as a vector defined by its frequencies across a set of unique words,
the document vectors together formed a matrix for the entire corpus, which could be
subjected to factorization via SVD. This would be refined with Probabilistic Latent
Semantic Analysis or PLSA [6] which provided a more interpretable framework by
defining the topics as probability distributions over words, and replacing SVD with a
more formal estimation procedure via the Expectation–Maximization (EM) algorithm.
Later, Latent Dirichlet Allocation(LDA) addressed some of PLSA’s shortcomings by
giving it a Bayesian flavor [7]. Nevertheless, LSA has arguably set the directi (...truncated)