Efficient continuous kNN join over dynamic high-dimensional data

World Wide Web, Sep 2023

Given a user dataset $$\varvec{U}$$ and an object dataset $$\varvec{I}$$ , a kNN join query in high-dimensional space returns the $$\varvec{k}$$ nearest neighbors of each object in dataset $$\varvec{U}$$ from the object dataset $$\varvec{I}$$ . The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR $$^+$$ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11280-023-01204-9.pdf

Efficient continuous kNN join over dynamic high-dimensional data

World Wide Web https://doi.org/10.1007/s11280-023-01204-9 Efficient continuous kNN join over dynamic high-dimensional data Nimish Ukey1 · Guangjian Zhang1 · Zhengyi Yang1 · Binghao Li2 · Wei Li3 · Wenjie Zhang1 Received: 27 February 2023 / Revised: 22 August 2023 / Accepted: 22 August 2023 © The Author(s) 2023 Abstract Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree. Keywords K nearest neighbors · KNN join · Dynamic data · High-dimensional data 1 Introduction The k-Nearest Neighbor (kNN) join problem is fundamental in many data analytic and data mining applications, such as classification [1–3], clustering [4, 5], outlier detection [6–10], similarity search [11–13], etc. It can also be applied in some applications of the healthcare Nimish Ukey and Guangjian Zhang contributed equally to this work. B Zhengyi Yang Extended author information available on the last page of the article 123 World Wide Web domain, such as for anomalies detection in healthcare data [14], multiclass classification [15], emotion classification [16], similarity search [17], to detect autism spectrum disorder (ASD) children [18], etc. Given a query dataset U and an object dataset I in high-dimensional space, a kNN join query returns the kNN of ALL objects in dataset U from dataset I . For example, social media platforms like YouTube, Netflix, Twitter, Facebook, and others use kNN join to represent people and content as feature vectors in a high-dimensional space so it can make suggestions based on what people like. E-commerce recommendation systems use kNN join similarly to suggest products to customers so that they are more likely to buy them. In many modern uses of kNN join, like the ones listed above, data is being created at a very fast rate. According to Twitter, approximately 350, 000 in tweets were sent per minute [19]. In many modern uses of kNN join, data is being created at a very fast rate. To utilise the newly generated data to provide an up-to-date and timely response, there emerges a demand for an efficient kNN join on highly dynamic data. We can see from existing work that the vast majority of existing kNN join approaches [9, 20–24] work with static data. For these methods to work with dynamic data, the kNN join to be recalculated from scratch every time the object dataset is updated, such as when a new object is added, or an old one is removed. This leads to massive processing time and causes extremely high latency. Yu et al. [25] devised the high-dimensional kNNJoin+ algorithm to dynamically update new data points, enabling incremental updates on kNN join results. But because it was a disk-based technique, it could not meet the real-time needs of real-world applications. Further work by Yang et al. [26] proposes the index structure of High-dimensional R-tree (HDR Tree) on dynamic kNN join (DkNNJ). It identifies data nodes whose kNN are affected by the inserted data and updates only the affected data points to avoid redundant computation. In addition, HDR Tree performs dimensionality reduction through principal component analysis (PCA) and clustering to further prune candidates. For update operations, insertion and deletion are the most fundamental operations. Referring to existing techniques, they primarily focus on the insertion operation. For every deletion of a data item, they have to recompute the kNN for all query points as in static solutions, which results in high time complicity and inefficiency. The results of a kNN join are updated by existing algorithms on every update operation, and none of them supports batch updates. Considering the fast growth of high-velocity streaming data, these approaches significantly limit the performance of dynamic kNN join on large datasets. To address these issues, we came up with lazy updates, batch updates, and optimised deletions in our previous work [27]. We design a lazy update mechanism. It identifies the users whose kNN should be updated on insertions and deletions and marks them as “dirty” nodes in the HDR Tree. The actual updating computation is delayed until the kNN values of the affected users are required. In batch updates, for a given batch of updates (i.e., insertions and deletions), we propose not to update the results immediately for each new item. Instead, we find out which users are affected by the batch of updates before we update them. It helps avoid redundant computation. Item deletions in kNN join are costly operations. We need to search all affected users and update their kNN list for any deletion operation. Thus, we propose to maintain a reverse kNN table for all items to speed up the process of searching for affected users. This paper extends the paper Efficient kNN Join over Dynamic High-dimensional Data [27]. Compared to the conference version, we further identify and address the following problems in existing solutions for kNN join over dynamic high-dimensional data, which are listed below: 1. Non-globally correlated data. Existing algorithms [25–27] heavily rely on global correlation in the datasets for effective dimensionality reduction. However, real-world datasets 123 World Wide Web are usually not globally correlated [28, 29]. Consequently, existing algorithms may fail to capture distinct features on non-globally correlated data. 2. Redundant PCA Computation. Earlier, every time a new item was inserted or deleted, we had to recompute the transformed dimensionality of that item based on the dimension of that level. This creates redundant PCA computation, which is very costly. 3. Inefficient (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007/s11280-023-01204-9.pdf
Article home page: https://link.springer.com/article/10.1007/s11280-023-01204-9

Ukey, Nimish, Zhang, Guangjian, Yang, Zhengyi, Li, Binghao, Li, Wei, Zhang, Wenjie. Efficient continuous kNN join over dynamic high-dimensional data, World Wide Web, 2023, pp. 1-36, DOI: 10.1007/s11280-023-01204-9