Efficient continuous kNN join over dynamic high-dimensional data
World Wide Web
https://doi.org/10.1007/s11280-023-01204-9
Efficient continuous kNN join over dynamic high-dimensional
data
Nimish Ukey1 · Guangjian Zhang1 · Zhengyi Yang1 · Binghao Li2 · Wei Li3 ·
Wenjie Zhang1
Received: 27 February 2023 / Revised: 22 August 2023 / Accepted: 22 August 2023
© The Author(s) 2023
Abstract
Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space
returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN
join is a basic and necessary operation in many applications, such as databases, data mining,
computer vision, multi-media, machine learning, recommendation systems, and many more.
In the real world, datasets frequently update dynamically as objects are added or removed. In
this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional
data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion,
and batch update. Further observed that the existing methods rely on globally correlated
datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters
the dataset and constructs multiple HDR Trees to capture local correlations among the data.
As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two
novel optimisations are applied to the proposed HDR Forest, including the precomputation of
the PCA states of data items and pruning-based kNN recomputation during item deletion. For
the completeness of the work, we also present the proof of computing distances in reduced
dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that
the proposed methods and optimisations outperform the baseline algorithms of naive RkNN
join and HDR Tree.
Keywords K nearest neighbors · KNN join · Dynamic data · High-dimensional data
1 Introduction
The k-Nearest Neighbor (kNN) join problem is fundamental in many data analytic and data
mining applications, such as classification [1–3], clustering [4, 5], outlier detection [6–10],
similarity search [11–13], etc. It can also be applied in some applications of the healthcare
Nimish Ukey and Guangjian Zhang contributed equally to this work.
B Zhengyi Yang
Extended author information available on the last page of the article
123
World Wide Web
domain, such as for anomalies detection in healthcare data [14], multiclass classification [15],
emotion classification [16], similarity search [17], to detect autism spectrum disorder (ASD)
children [18], etc. Given a query dataset U and an object dataset I in high-dimensional space,
a kNN join query returns the kNN of ALL objects in dataset U from dataset I . For example,
social media platforms like YouTube, Netflix, Twitter, Facebook, and others use kNN join to
represent people and content as feature vectors in a high-dimensional space so it can make
suggestions based on what people like. E-commerce recommendation systems use kNN join
similarly to suggest products to customers so that they are more likely to buy them.
In many modern uses of kNN join, like the ones listed above, data is being created at a very
fast rate. According to Twitter, approximately 350, 000 in tweets were sent per minute [19].
In many modern uses of kNN join, data is being created at a very fast rate. To utilise the
newly generated data to provide an up-to-date and timely response, there emerges a demand
for an efficient kNN join on highly dynamic data.
We can see from existing work that the vast majority of existing kNN join approaches [9,
20–24] work with static data. For these methods to work with dynamic data, the kNN join to be
recalculated from scratch every time the object dataset is updated, such as when a new object
is added, or an old one is removed. This leads to massive processing time and causes extremely
high latency. Yu et al. [25] devised the high-dimensional kNNJoin+ algorithm to dynamically
update new data points, enabling incremental updates on kNN join results. But because it was a
disk-based technique, it could not meet the real-time needs of real-world applications. Further
work by Yang et al. [26] proposes the index structure of High-dimensional R-tree (HDR
Tree) on dynamic kNN join (DkNNJ). It identifies data nodes whose kNN are affected by the
inserted data and updates only the affected data points to avoid redundant computation. In
addition, HDR Tree performs dimensionality reduction through principal component analysis
(PCA) and clustering to further prune candidates.
For update operations, insertion and deletion are the most fundamental operations. Referring to existing techniques, they primarily focus on the insertion operation. For every deletion
of a data item, they have to recompute the kNN for all query points as in static solutions,
which results in high time complicity and inefficiency. The results of a kNN join are updated
by existing algorithms on every update operation, and none of them supports batch updates.
Considering the fast growth of high-velocity streaming data, these approaches significantly
limit the performance of dynamic kNN join on large datasets. To address these issues, we
came up with lazy updates, batch updates, and optimised deletions in our previous work [27].
We design a lazy update mechanism. It identifies the users whose kNN should be updated
on insertions and deletions and marks them as “dirty” nodes in the HDR Tree. The actual
updating computation is delayed until the kNN values of the affected users are required. In
batch updates, for a given batch of updates (i.e., insertions and deletions), we propose not
to update the results immediately for each new item. Instead, we find out which users are
affected by the batch of updates before we update them. It helps avoid redundant computation. Item deletions in kNN join are costly operations. We need to search all affected users
and update their kNN list for any deletion operation. Thus, we propose to maintain a reverse
kNN table for all items to speed up the process of searching for affected users.
This paper extends the paper Efficient kNN Join over Dynamic High-dimensional
Data [27]. Compared to the conference version, we further identify and address the following problems in existing solutions for kNN join over dynamic high-dimensional data,
which are listed below:
1. Non-globally correlated data. Existing algorithms [25–27] heavily rely on global correlation in the datasets for effective dimensionality reduction. However, real-world datasets
123
World Wide Web
are usually not globally correlated [28, 29]. Consequently, existing algorithms may fail
to capture distinct features on non-globally correlated data.
2. Redundant PCA Computation. Earlier, every time a new item was inserted or deleted, we
had to recompute the transformed dimensionality of that item based on the dimension of
that level. This creates redundant PCA computation, which is very costly.
3. Inefficient (...truncated)