Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

Annals of Data Science, Jan 2025

Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s40745-025-00587-y.pdf

Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

Annals of Data Science https://doi.org/10.1007/s40745-025-00587-y ORIGINAL ARTICLE Semiparametric Latent Topic Modeling on Consumer-Generated Corpora Dominic B. Dayta1 · Erniel B. Barrios2 Received: 17 October 2023 / Revised: 18 December 2024 / Accepted: 17 January 2025 © The Author(s) 2025 Abstract Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods. Keywords Topic modelling · Semiparametric regression · Latent dirichlet allocation · Nonnegative matrix factorization · Customer complaint B Erniel B. Barrios Dominic B. Dayta 1 School of Statistics, University of the Philippines Diliman, Quezon City, Philippines 2 School of Business, Monash University Malaysia, Selangor, Malaysia 123 Annals of Data Science 1 Introduction The fields of natural language processing and information retrieval saw a productive past two decades due largely to the emergence and worldwide adoption of two modern technologies: large-scale document indexing and storage facilities, of which perhaps the two most prominent brands are JSTOR and Google Books, and social networking sites that allow individual users to create and distribute various types of content, a considerable fraction of which exist in the form of texts (status updates, blog posts, and tweets). All these have led to a relentless growth in information-rich but unstructured collections of text data––referred to as corpora in natural language terminology––in terms of volume, velocity, and frequency such that manual approaches to document indexing and classification are quickly becoming obsolete. Outside the context of online archives, methods that enable automated classification and analysis of voluminous corpora would prove to be valuable technology. It has been applied to legal research [1] and for analyzing patterns behind railroad accidents [2]. In the commercial space, companies can take advantage of thousands of posts being contributed by users on a daily basis about their products and services on social media and review aggregator websites like Yelp and TripAdvisor. Among the core functions of Customer Relations Management (CRM) departments in customer-facing industries is capturing what they call the Voice of the Customers (VOC). VOC refers to feedback, self-reported by the customers in the form of verbatim complaints, comments, inquiries, and the likes sent in via one or more points of capture. Presently, the standard approach that industries have taken towards the capture and analysis of VOC is via the employment of a Business Process Outsourcing (BPO) partner, that would, in turn, deploy customer care representatives to receive and process feedback. Representatives are trained in handling customers and are oriented towards categorizing feedback according to subject. Through this arrangement, previously unstructured data from call transcripts, e-mails, SMS, social media posts, and other possible venues are transformed into structured (i.e., tabular) summaries which are then sent to the client company for resolution, actions, and further analysis. The proliferation of social networking and micro-blogging services on the Internet has given consumers an inexhaustible variety of platforms through which they may voice out satisfaction or dissatisfaction towards these companies’ products and services. All these have led to a relentless growth for VOC in terms of volume, velocity, and frequency such that manual approaches to feedback management are quickly becoming inefficient. Successful formulation of a new methodology for automated complaints classification would not only impact businesses directly concerned, but also their outsourced service providers as this would ease the growing tedium of manual feedback capture systems and allow for better, more strategic allocation and management of manpower. This need for automation is hardly novel in the literature. Hotel reviews on certain travel websites have been analyzed to the effect of identifying driving factors to customer satisfaction [3]. This was accomplished by grouping together known words appearing in the reviews under general tokens that identify thematic similarities between customers’ complaints and commendations. Other, more sophisticated approaches involve the use of fuzzy algorithms [4]. 123 Annals of Data Science Both approaches can be seen as forerunners to the use topic modeling for analyzing VOC, wherein the topic structures were defined a priori by the researchers (or, in the latter case, through fuzzy logic). In true topic modeling, these structures are discovered rather than pre-determined by the analyst, and this discovery provides the objective of the algorithm. In [3] Latent Semantic Analysis (LSA), is used to extract linguistic characteristics from customer complaints, and these characteristics were later used as features in a classification model. The method of Singular Value Decomposition (SVD) is used to discover underlying “semantic structures” defined by word co-occurrences, [5]. This method, like the others that would succeed it, depended on a specific representation of the corpora into a matrix form that is much more suited for statistical analysis. By representing each document as a vector defined by its frequencies across a set of unique words, the document vectors together formed a matrix for the entire corpus, which could be subjected to factorization via SVD. This would be refined with Probabilistic Latent Semantic Analysis or PLSA [6] which provided a more interpretable framework by defining the topics as probability distributions over words, and replacing SVD with a more formal estimation procedure via the Expectation–Maximization (EM) algorithm. Later, Latent Dirichlet Allocation(LDA) addressed some of PLSA’s shortcomings by giving it a Bayesian flavor [7]. Nevertheless, LSA has arguably set the directi (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s40745-025-00587-y.pdf
Article home page: https://link.springer.com/article/10.1007/s40745-025-00587-y

Dayta, Dominic B., Barrios, Erniel B.. Semiparametric Latent Topic Modeling on Consumer-Generated Corpora, Annals of Data Science, 2025, pp. 1-23, DOI: 10.1007/s40745-025-00587-y