Identifying the Intents Behind Website Visits by Employing Unsupervised Machine Learning Models
Annals of Data Science
https://doi.org/10.1007/s40745-024-00586-5
REVIEW ARTICLE
Identifying the Intents Behind Website Visits by Employing
Unsupervised Machine Learning Models
Judah Soobramoney1 · Retius Chifurira1 · Temesgen Zewotir1 ·
Knowledge Chinhamu1
Received: 8 October 2022 / Revised: 17 October 2023 / Accepted: 19 November 2023
© The Author(s) 2025
Abstract
With digitisation globally on the rise, corporates are compelled to better understand the
usage of their websites. In doing so, corporates will be empowered to better understand
consumers, and make necessary adjustments to ultimately improve the corporate’s
stance in the competitive global landscape of this modern age. However, the online
website visit data has proven to be highly complex, big in data volume, and highly
transactional with users expressing unique behaviours. Thus, extracting insight can be
a complex problem to solve. This study aimed to employ unsupervised machine learning models to identify the intentions behind the visits on the observed website. The
data studied was sourced from the Google Analytics tracking tool that was deployed
on a corporate informative website. The study employed a k-means, hierarchical and
dbscan unsupervised machine learning models to understand the intents behind visitors on the studied website. All three models detected five major intents that were
expressed within the observed data. The intents identified were labelled as “accidentals”, “drop-offs”, “engrossed”, “get-in-touch” and “seekers”. On the observed data,
all three unsupervised machine learning methods have performed well. However, in
the context of the study, which investigated the intents that drove online visits, the hierarchical clustering method yielded superior results by maintaining the best balance
between cluster homogeneity (stronger silhouette coefficients) and cluster size.
Keywords Cubic clustering criteria · Dbscan · Dendrogram · Google analytics
tracking · Hierarchical clustering · K-means · Online website visits · Silhouette’s
coefficient · Unsupervised machine learning
B
1
Judah Soobramoney
School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban,
South Africa
123
Annals of Data Science
1 Introduction
In a digital world, corporate’s leverage of web analytics to optimize their offerings,
image and operations [1, 2]. Web analytics entails the study of a website’s users’ online
activity for business intel. Tracking tools such as Google Analytics tracking empower
website owners to have sight of the volume of visitors that enter the website, the web
pages they visit, the page path followed, the time spent on the website, the device type
used, the device brand, the operating system, the coordinates of the device, the number
of times the device has accessed the website, the entry point to the website and much
more [3].
Data science techniques can be employed to make sense of the often big, and complex web visit data [4]. The art of data science represents an interdisciplinary field
that employs statistical, specialized programming and domain knowledge to analyse
big and complex data such as web visit data. In this study, the board of directors of
engineering and engineering training corporate (TEKmation) were interested in the
activity on its website. The business needed intel on what visitors were doing on its
website [5, 6]. With a high volume of visitors with each entering the website for different intentions and following unique page paths, the data at face-value proved highly
complex. To better understand the website usage in a manner that was reasonably
ingestible, the study employed data science techniques through three unsupervised
machine learning models to cluster the web traffic to better understand the underlying
intentions inherent within the data. Three different unsupervised machine learning
methods were studied and the intents that emerged are compared within this study.
1.1 Related Work
With the digital marketplace growing globally, such web analytics is becoming increasingly more important [7, 8]. Idowu and Kattukottai employed several clustering
methods to segment online purchase data following a recency, frequency and monetary
model. The study found that the hierarchical clustering method performed best within
the study [9]. Porsche et al. [10] conducted a study to understand reading behaviour
within online books. The main purpose of the study was to assess the performance of
the Google Analytics tracking tool as a format suitable for advanced tracking of reading behaviour within webbooks, prescribe measurements for reading the behaviour of
webbooks and present the results of a pilot study. Through the analytics conducted, the
study suggested deployment procedures of web books and presented possible methods of web book performance evaluation. Furthermore, the study concluded that the
Google Analytics tool was a valuable tool for tracking traffic to individual books and
quantifying the traffic to the entire webbook collection on the observed data, through
the use of unique custom and advanced metrics that were proposed [10].
Domazet and Simovic [11] conducted a study to measure the performance of an
online informal educational institution through web analytics. The goal of the study
sought to determine the best-performing acquisition channel for non-formal educational institutions and the aggregate visitor profile of this kind of educational program
by means of visitor acquisition and behaviour data. The key metrics employed to assess
123
Annals of Data Science
the performance of the various acquisition channels were the conversion rate, average
session duration, and bounce rate. However, visitor demographics (such as gender and
age data) were supplemented on the side of the visitor-specific data. The findings of
the study concluded favourably and suggested that the findings that emerged from this
study could apply to other non-formal educational institutions alike [11].
Jonathan et al. [12] employed web analytics to understand church members’ activities on the “Church Cast” application that hosted online sermons. It was believed that
the study would ultimately increase user knowledge and interaction. “Church Cast”
was developed to avail sermons of the Gospel ministries more conveniently to their
church members through the digital channels of the internet and mobile devices. It
was believed that low-capacity church members with time constraints and religious
restrictions in certain parts of the world have resulted in church members being unable
to physically attend at their locations of worship to listen to or watch their ministers.
The application, being web-based, tracked visitors usage and thereby informed the
administrator on visitors’ activities on the application [12].
Semeradova and Weinlich [13] sought to examine web traffic of user-friendly websites and thereby proposed an analytical procedure based on data sourced fro (...truncated)