A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters
A Novel Accuracy and Similarity Search Structure Based on Parallel Bloom Filters
Chunyan Shuai,1 Hengcheng Yang,1 Xin Ouyang,2 Siqi Li,1 and Zheng Chen3
1Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650051, China
2Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650051, China
3Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650051, China
Received 21 April 2016; Revised 25 September 2016; Accepted 26 October 2016
Academic Editor: Hong Man
Copyright © 2016 Chunyan Shuai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.
1. Introduction
In high-dimensional spaces, exact search methods, such as kd-tree approaches and Q-gram, are only suitable for small size vectors due to huge computation resources. However, similar search algorithms can drastically improve the search speed while maintaining good precision [1], which include VA-files, best-bin-first, space filling curves, K-means (see [2] and references therein), NV tree [3], K-nearest neighbors (K-NN), and locality-sensitive hashing (LSH) [4]. Most K-NN methods adopt the Euclidean distance; they assume all coordinates are numerical and own same units and semantics. But, in some applications, the dimension may be string or category, which makes the Euclidean distance questionable and artificial.
In query tools, a bloom filter [5] (BF), as a space-efficient and constant query delay random data structure, has been applied to present a big set and retrieve memberships broadly [6]. But the BF only can present 1-dimensional elements of a set; references [7–9] extended it to present high-dimensional sets and dynamic sets. But these methods can only answer the membership query, not the similarity query. In [10, 11], the LSH functions replace the random hash functions of the BF to implement the similarity search, while [10, 11] only can deal with numerical coordinates and return the elements whose distances from the query are at most CR distance in Euclidean spaces, which lead to false negative probability (FNP).
Here, by computing the Hamming distance, we propose a new structure, called Similar-PBF-PHT, based on the BFs and hash tables (HT) to search the membership as well as the K-NN regardless of the radius CR. The Similar-PBF-PHT includes PBFs, PHTs, and a bitmatrix. The PBFs and PHTs apply BFs and HTs to store dimensions, and the bitmatrix stores the dependences of the dimensions. The experiments show that the Similar-PBF-PHT owns better performance in Hamming spaces than other methods. Meanwhile, with K-NN searching, it gets a balance performance and can process different data formats while other LSH-based methods can only deal with numerical value.
2. Related Work
There are different kinds of approximate search algorithms, and we divide them into three categories to discuss.
The famous one is space partition method, including IDistance [12] and MedRank [13]. The IDistance [12] clusters all high-dimensional elements into multiply spaces and converts them into 1-dimension space. It costs linear space and supports data insertion and deletion; however, if the data distribute uniformly or dimensions are anisotropic, space partition and center selection will be difficult. The MedRank [13] is a rank aggregation and instance optional algorithm, which aggregates the given dataset into sorted lists, where every element has an entry with a form (Id, key). The number of the lists equals , where is the number of the elements, and by lists probing, the MedRank finds out approximate NN items. The MedRank possesses the best linear-space, but an element insertion or deletion needs update lists and every list requires sorting again.
The LSH and its variants are other famous K-NN (...truncated)