Nearest Neighbor Queries in Metric Spaces

Discrete & Computational Geometry, Jul 1999

Abstract. Given a set S of n sites (points), and a distance measure d , the nearestneighborsearching problem is to build a data structure so that given a query point q , the site nearest to q can be found quickly. This paper gives data structures for this problem when the sites and queries are in a metric space. One data structure, D(S) , uses a divide-and-conquer recursion. The other data structure, M(S,Q) , is somewhat like a skiplist. Both are simple and implementable. The data structures are analyzed when the metric space obeys a certain sphere-packing bound, and when the sites and query points are random and have distributions with an exchangeability property. This property implies, for example, that query point q is a random element of \( S\cup\{q\} \) . Under these conditions, the preprocessing and space bounds for the algorithms are close to linear in n . They depend also on the sphere-packing bound, and on the logarithm of the distanceratio\( \Upsilon(S) \) of S , the ratio of the distance between the farthest pair of points in S to the distance between the closest pair. The data structure M(S,Q) requires as input data an additional set Q , taken to be representative of the query points. The resource bounds of M(S,Q) have a dependence on the distance ratio of S\(\cup\)Q . While M(S,Q) can return wrong answers, its failure probability can be bounded, and is decreasing in a parameter K . Here K≤ |Q|/n is chosen when building M(S,Q) . The expected query time for M(S,Q) is O(Klog  n)log \(\Upsilon(S\cup Q)\) , and the resource bounds increase linearly in K . The data structure D(S) has expected O( log n) O(1) query time, for fixed distance ratio. The preprocessing algorithm for M(S,Q) can be used to solve the all nearest neighbor problem for S in O(n(log  n) 2 (log ϒ(S)) 2 ) expected time.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2FPL00009449.pdf

Nearest Neighbor Queries in Metric Spaces

Discrete Comput Geom Nearest Neighbor Queries in Metric Spaces K. L. Clarkson 0 0 Bell Laboratories , Lucent Technologies, Murray Hill, NJ 07974 , USA Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives data structures for this problem when the sites and queries are in a metric space. One data structure, D.S/, uses a divide-and-conquer recursion. The other data structure, M .S; Q/, is somewhat like a skiplist. Both are simple and implementable. The data structures are analyzed when the metric space obeys a certain sphere-packing bound, and when the sites and query points are random and have distributions with an exchangeability property. This property implies, for example, that query point q is a random element of S [ fqg. Under these conditions, the preprocessing and space bounds for the algorithms are close to linear in n. They depend also on the sphere-packing bound, and on the logarithm of the distance ratio 7 .S/ of S, the ratio of the distance between the farthest pair of points in S to the distance between the closest pair. The data structure M .S; Q/ requires as input data an additional set Q, taken to be representative of the query points. The resource bounds of M .S; Q/ have a dependence on the distance ratio of S [ Q. While M .S; Q/ can return wrong answers, its failure probability can be bounded, and is decreasing in a parameter K . Here K · jQj=n is chosen when building M .S; Q/. The expected query time for M .S; Q/ is O.K log n/ log 7 .S [ Q/, and the resource bounds increase linearly in K . The data structure D.S/ has expected O.log n/O.1/ query time, for fixed distance ratio. The preprocessing algorithm for M .S; Q/ can be used to solve the all nearest neighbor problem for S in O.n.log n/2.log 7 .S//2/ expected time. - This paper addresses algorithmic questions related to the post office or nearest neighbor problem: Let .V ; d / be a metric space, where V is a set and d is a distance measure on V . Given a set S ½ V of n sites (points), build a data structure so that given a query point q 2 V , the nearest site to q can be found quickly. Two data structures are given here for this problem. One data structure, M .S; Q/, requires an additional set Q of m points, taken to be representative of typical query points. The data structure M .S; Q/ may fail to return a correct answer, but the failure probability can be made arbitrarily small. This decrease requires a proportional increase in query time and data structure space. The other data structure, D.S/, always returns a correct answer and does not need the set Q, but its provable resource bounds are worse. The data structure M .S; Q/ has been implemented, with some preliminary tests. For example, when the points of S, Q, and q are uniformly distributed in a square, and the distance measure is Euclidean (`2), a version of the data structure gives the correct answer for all of 500 tests. For this version, searching requires about 21 distance evaluations for jSj D 2000, and the space required is about 8 integers=site. (The data structure implemented uses K D 1000, ° D 1:2, and uses an “early termination” to a collection of sites to be searched by brute force.) Note that the algorithm uses the distance measure as a “black box.” With 4000 similarly distributed points in <20, an average search time of 604 distance evaluations gives a site for which about 0.6 sites are closer, on average, and at an average distance 2% farther than the true nearest neighbor distance. The provable bounds require some general conditions on the data, and on the metric space. The failure probability bounds hold for M .S; Q/ when q is a random element of Q [ fqg; we call this requirement on q and Q an exchangeability condition. When Q and q are each generated by some random process, each with their probability distribution, the exchangeability condition means that if we are given Q and q, and then take a random element of Q [ fqg, then that element will have the same probability distribution as q. The bounds on the query time hold when Q, S, and fqg satisfy the exchangeability condition that they are each random subsets of Q [ S [ fqg. The proofs of the bounds also require that the metric spaces have certain nearest neighbor bounds, which are implied by sphere-packing bounds; such properties are described in Section 3.4. In particular, <k has these properties under L p metrics. The constants associated with these properties appear in the bounds for the algorithms, and are in general exponential in k. Many of the provable bounds also involve the distance ratio, denoted 7 .T / for a set T , which is the ratio of the distance between the farthest pair of points in T to distance between the closest pair in T . The quantity 7 .S/ appears in bounds for D.S/, and the quantity 7 .S [ Q/ appears in bounds for M .S; Q/. The dependence is relatively mild, generally O.log 7 /O.1/. It is reasonable in practice to assume the rough relation 7 D n O.1/, implying that “all logs are equal.” The preprocessing for D.S/ needs O.n/.lg n/O.lg lg 7.S// expected time, resulting in a data structure that answers queries q with q 2R S [ fqg, that is, q is a random element of S [ fqg, in expected O..lg n/O.1/C2 lg lg.7.S/// time, and needing O.n.lg n/O.1/C2 lg lg.7.S/// expected space. (The dependence on the spherepacking bounds for the metric space is suppressed here.) The preprocessing for M .S; Q/ needs O.K n.log n/2/.log 7 .S [ Q/2/ expected time, yielding a data structure that answers queries in expected time, with failure probability O.log2 n/=K , and needing space O.K ln n/ lg 7 .S [ Q/ O.K n log 7 .S [ Q//: Either preprocessing method also solves the all nearest neighbor problem: find, for each site in S, the nearest other site. Algorithms for the all nearest neighbor problem that need O.n log n/ time have been known since 1983 for <k [C1], [V], but the algorithms given here are the first with near-linear bounds that use the distance measure alone; in particular, they do not use quadtrees or quadtree-like spatial subdivisions as in previous work. 1.1. Why Metric Spaces? Recall that if .V ; d/ is a metric space, then, for all a; b; c 2 V , d.a; a/ D 0, d.a; b/ D d.b; a/, and the triangle inequality d.a; c/ · d.a; b/ C d.b; c/ holds. The approach taken here is to find algorithms that can be applied to general metric spaces, but that have provable properties for important special cases. There are several reasons for considering the nearest neighbor problem in this generality. While closestpoint problems have applications in statistics, data compression, information retrieval, and other areas, many such applications are in a high-dimensional setting, for which almost all known solutions either give a slowdown over the naive algorithm, or use too much space. On the other hand, high-dimensional data often has useful structure. It may lie in a lower-dimensional hyperplane or flat, or manifold, for example. Still, such structure is not always readily apparent, as when the data lies in a manifold that is not flat, or has fractal properties [FC]. Hence it is desirable to seek algorithms whose complexity does not depend strictly on the dimension, or on using most coordinates, but rather on intrinsic properties of the data. There are other reasons to consider general metric spaces: some distance measures are not applied to spaces with real coordinates, such as edit distance on strings. (We should note, however, that strings do not seem to have a useful sphere-packing bound.) Moreover, some spaces of sites and queries have coordinates that are restricted to small sets, so that intuitions or constructions from Euclidean space are not necessarily helpful. If all coordinates are 0 or 1, then distance evaluations between points are faster on real machines, using bit-twiddling, than other coordinatewise operations. An algorithm that uses the distance measure, and only as a “black box,” will probably not be too complicated: there are no operations on which such complexity can depend. Finally, from a theoretical point of view, it seems reasonable to strip the problem down to its essentials, and find the minimal properties needed for fast algorithms. Algorithms operating in general metric spaces have been considered for some time [FS] and there are some more recent contributions as well [B], [U], [Y]. Most of the proposed data structures are trees, and some are variations on the kd-tree, perhaps the most important practical method for nearest neighbor searching in <k , for small dimension k. The data structure D.S/ uses divide-and-conquer in a somewhat similar way; in contrast, the data structure M .S; Q/ is somewhat akin to a skip list. 1.2. Why Q? The need for the set Q for data structure M .S; Q/ contrasts unfavorably with most earlier work on the post office problem, where there is no such limitation. Also, the analysis requirement that q have the same distribution as Q and S is restrictive. However, in some significant applications, such as vector quantization or nearest neighbor classification, the queries have the same or similar distribution as the sites, and so the set Q is simply more training data. In other cases, say for coordinate spaces, the representative queries that are needed can be at the least generated uniformly within a region containing the sites. Also, such a set Q is available in a “deferred data structuring” setting [MR]. Here no preprocessing is done, and a single query is answered with the naive algorithm. As more queries are answered, however, the set Q is built up, and the data structures described here can be used to speed up subsequent queries, under the assumption that future queries have a similar distribution to past ones. 2. Outline After some motivating remarks, the next section describes M .S; Q/, and gives failure probability bounds, first for general metric spaces, and then for spaces that have a certain ° -dominator bound. Such a bound is implied by sphere-packing bounds, discussed in Section 3.4, which also discusses ° -nearest neighbor bounds, also implied by spherepacking bounds. Such ° -nearest neighbor bounds are used in Section 9, and in the following subsection addressing space bounds for M .S; Q/. A provably fast preprocessing algorithm for M .S; Q/ is discussed in Section 4, including some interesting relations holding for ° -nearest neighbors of ¯-nearest neighbors. The data structure D.S/ is discussed in Section 4, which also includes the use of the data structure for inverse queries, where the sites that have a given point as nearest neighbor are wanted. 3. Data structure M.S; Q/ 3.1. Motivation The original idea for the data structure goes back to a paper on the Euclidean version of the problem [C2], and probably further than that. The idea is to use a small subset R ½ S to bound search: for a given query q, the distance of q to its nearest neighbor in R gives an upper bound on the distance to its nearest neighbor in S. That is, the nearest site to q in S is contained within the nearest neighbor ball B.q; R/ of q with respect to R. Definition (B.q; R/). For metric space .V ; d/ with R ½ V , the nearest neighbor ball of q with respect to R is fx 2 V j d.q; x / · d.q; y/ for all y 2 Rnfqgg : See Fig. 1. For a site p 2 R, let C . p; R/ be the union of balls B.q; R/ over all (potential) query points q with p closest in R. Then, for a given query point q, the knowledge that q is in the Voronoi region Vor. p/ of p, that is, that p is closest to q in R, allows the search for nearest sites to q to be restricted to C . p; R/ \ S. (See Fig. 2.) This suggests a recursive construction, building a data structure for each C . p; R/\S in a similar fashion. This is the idea of “RPO trees” [C2]. Such a construction is predicated on the hope that each C . p; R/ \ S is small; this is false in general. However, in the Euclidean case, when R is a random subset of S, and the Voronoi region Vor. p/ is split up into combinatorially simple regions, then such a region T will have C .T / \ S small, with high probability [C2]. However, the latter construction seems to need a triangulation of the Voronoi diagram of R, which is hard to extend to arbitrary metric spaces. Here we take a simpler approach, and approximate C . p; R/ with the union C 0. p; R/ ´ [ q02Vor.p/\Q B.q0; R/ p q of nearest neighbor balls of members of Q. That is, we approximate the construction of the region C . p; R/ by approximating the Voronoi region of p with Vor. p/ \ Q. Our hope is that, for most of the query points q that will be encountered, if q is in Vor. p/, then the set C 0. p; R/ \ S will contain the nearest site to q. (See Fig. 3.) To compensate for using Vor. p/ \ Q to find C 0. p; R/, rather than Vor. p/, we expand each nearest neighbor ball by a given factor ° : we approximate C . p; R/ by C °0 . p; R/, which is the union of the balls B° .q; R/. Such a ball is the expansion of B.q; R/ by the factor ° . Definition (B° .q; R/). For metric .V ; d/ with R ½ V , the ° -nearest neighbor ball of q with respect to R is fx 2 V j d.q; x / · ° d.q; y/ for all y 2 Rnfqgg : Heuristically, this seems like a reasonable way to be more likely to get a correct answer. We can say more, however, as the following lemma states. Lemma 1. Suppose measure d satisfies the triangle inequality. If points q and q0 have p nearest in R, and d.q; p/ < d.q0; p/, then B.q; R/ ½ B3.q0; R/. Proof. Suppose p0 is closer to q than p. Then d. p0; q0/ · d. p0; q/ C d.q; p/ C d. p; q0/ · 3d. p; q0/: If, for some query point q, the construction of C 30. p; R/ included a point q0 in the Vor. p/ \ S, and d.q0; p/ > d.q; p/, then C 30. p; R/ \ S contains the nearest site to q. (See Fig. 4.) When the query points are distributed similarly to the points of Q, the probability that some C 30. p; R/ will contain B.q; R/ is bounded by r=jQj: roughly, a random member of Q has this probability of being “covered” in this sense, and the query point is will behave the same way. 3.2. The Data Structure The data structure M .S; Q/ is based on these ideas, but uses a more incremental approach. First, some definitions. Definition (Nearest Neighbors). For metric space .V ; d/, set R ½ V , point q 2 V , the nearest neighbor distance of q with respect to R is d.q; R/ ´ p2Rnfqg min fd.q; p/g: A point p 2 R realizing that distance is a nearest neighbor of q in R. Definition (° -Nearest Neighbors). Say that p 2 V is a ° -nearest neighbor of q 2 V with respect to R if d.q; p/ · ° d.q; R/. That is, p is a ° -nearest neighbor of q with respect to R if and only if p 2 B° .q; R/. It will be convenient at times to use still another notation for proximity. ° ° Definition (q ¡! p). In a metric space .V ; d/, for q; p 2 V , and R ½ V , let q ¡! p denote the condition that p is a ° -nearest neighbor of q with respect to R, that is, ° ° d.q; p/ · ° d.q; R/, and so p 2 B° .q; R/. Also write p á q if and only if q ¡! p. Armed with all these ways of saying the same thing, here is a succinct definition of M .S; Q/. Construction of M.S; Q/. The data structure M .S; Q/ for nearest neighbor searching is, for each site pj 2 S, a list of sites Aj . The data structure has a real parameter ° , picked before construction. Let . p1; p2; : : : ; pn/ be a random permutation of S, and let Ri ´ f p1; p2; : : : ; pi g, so that Ri is a random subset of S. Shuffle Q with a random permutation as well, so that Q j is a random subset of Q of size j , for j D 1 ¢ ¢ ¢ m D jQj. It will be helpful to define Q j ´ Q for j > m. Define Aj as n pi j i > j; there exists q 2 Q K i with pj Ã1¡ q ¡!° pi ; with respect to Ri¡1o : The sites in Aj are in increasing order of index i . This is the whole definition of M .S; Q/. While it is satisfying that M .S; Q/ can be so succinctly defined, a more detailed discussion may be helpful. Consider the subsets Ri to be built incrementally, by picking pi randomly from SnRi¡1, and adding it to Ri¡1, yielding Ri . (Hereafter the numbering of the sites will be fixed in this random order.) Each list Aj starts out empty. When adding site pi to make Ri , append pi to Aj if there is some q 2 Q K i with pj nearest to q in Ri¡1, and pi is a ° -nearest neighbor of q. That is, for q 2 Q, maintain the nearest neighbor to q in Rj , for j D 1 ¢ ¢ ¢ n. When adding pi , check each q 2 Q, and record pi as nearest where appropriate, and also add pi to Aj if q 2 Q K i and pi is ° -nearest to q. What does this construction mean? Note that in the terminology of the last subsection, we add pi to Aj when pi 2 C °0 . pj ; Ri¡1/, our approximation to the set of sites that may be nearer to some query point q if q has pj nearest in kRi¡1. Consider the special case where ° D 1, S ½ < , Q D <k , K ! 1, and the distributions of S and Q are uniform in a box. Here the construction puts pi on Aj just when pi takes away part of the Voronoi region of pj , as witnessed by a member of Q K i ; for large enough K , pi is a Delaunay neighbor of pj in Ri . The Search Procedure for M.S; Q/. The search procedure is as follows: given query point q, start with site p1 as the candidate closest site to q. Walk down A1, until a site pj closer to q than p1 is found. Now pj is the candidate closest site to q; do a similar walk down Aj . Repeat until the Ak list for some site pk is searched, and no site on that list is closer than pk . Terminate the search and return pk as closest site. This is the search procedure. Note that, when K D 1, so that Q K i D Q for all i , this search returns by construction the correct closest site for all elements of Q. When the search procedure is applied to some q 2 Q, if pi appears as a candidate closest site to q, then pi is closest to q in Ri . The query time and the probability of returning the correct nearest site increase with K and ° . The average number of members of Q K i with pj nearest in Ri is K . There are two uses of randomness here: for picking Ri , and for picking Q K i . The use of random subsets of the sites is similar to its use in some other randomized geometric algorithms [M], [C3], and helps to speed up the answering of queries. The random subset Q K i , on the other hand, serves as a proxy for members of some universe of queries, and aids correctness; intuitively, any possible query point q will have a member q0 of Q K i not too far away, and hence a site nearest to q will be near to q0. 3.3. Failure Probability Analysis The following bound on the failure probability holds without any restrictions on the metric space. Theorem 2. Suppose Q and q are such that q is a random element of Q [ fqg. Assume K n < m D n O.1/. If M .S; Q/ is built with ° D 3, then the probability that it fails to return a nearest site to q in S is O.log2 n/=K . The proof depends on Lemma 1 and the following lemma. Lemma 3. Define Q ´ Q K i [ fqg. Assume K n < m D n O.1/. With probability 1 ¡ 1=n2, for every q0 2 Q0, the number of sites closer to q0 than its nearest neighbor in Ri is · D O.log n/n= i . Proof. The set Q0 is fixed with respect to the random choice of Ri . Suppose q0 2 Q0 and p 2 S are such that there are k points of S closer to q0 than p is. The probability that p 2 Ri and also that p is the closest site in Ri to q0 is no more than ¡n¡k¡1¢ i¡1 ¡n¢ i i e¡k.i¡1/=.n¡1/: · n Since the number of such pairs p and q0 is no more than .K i C 1/.n ¡ k/ < mn, the claim follows for k > · ´ .3 ln n C ln m/.n ¡ 1/=.i ¡ 1/ D O.log n/n= i . Proof of Theorem 2. Let q be a random query point. Consider the construction of the data structure when point piC1 is added to make RiC1, so piC1 is possibly added to some Aj lists. Suppose pj is closest to q in Ri , but piC1 is closer. Then the query procedure should change the candidate closest site from pj to piC1, but the procedure is not certain to do so if piC1 is not in Aj . Conversely, if such appropriate entries in the Aj lists are present, for each j with 1 · j · n, then the query for q will be answered correctly. This implies that the probability that the construction fails to produce a data structure that answers a query correctly is no more than the sum over i , for i D 0 ¢ ¢ ¢ n ¡ 1, of the probability that a failure occurs when piC1 is added. Thus we seek an upper bound on the probability that, when piC1 is added: (1) With respect to Ri , but (2) there is no point v 2 Q K i with piC1 Ã<¡1 q ¡! pj ; 1 3 1 piC1 á v ¡! pj : Let Q0 ´ Q K i [ fqg. For each pk 2 Ri , let vk be the point in Q0 that has pk as nearest neighbor in Ri , and that has maximum distance to pk among all such points. If (1) holds, and q 6D vj , then (2) holds, by Lemma 1. (Put q0 D vj , q D q, and p D pj in the lemma.) To bound the probability that q D vj for some j , use the condition of the theorem that q is a random element of Q0K i . The probability that a random element of Q0 is one of the i points vj is i =jQ0j D i =.K i C 1/ < 1=K ; and so the probability of (2), given (1), is at most 1=K . If some q0 2 Q0 has more than · D .3 ln n C ln m/.n ¡ 1/=.i ¡ 1/ sites closer to it than its nearest neighbor in Ri , consider the addition of piC1 a failure; this occurs with probability 1=n2, by Lemma 3. Suppose that the addition of piC1 is not a failure for this reason, so that every q0 2 Q0 has fewer than · sites closer to it than its nearest neighbor in Ri . The probability of (1) is then ·=n, for q any member of Q0K i , and so the probability of failure when adding piC1 is at most 1 · 1 n2 C n K D O.log n/ i K ; for i > 0; using the trivial bound for small i , and adding this for i up to n bounds the overall failure probability at O.log2 n/=K . A similar bound holds when ° < 3, for metric spaces that satisfy an additional condition, which can be expressed using the following definition. Definition (° -Dominating). For a; b; c 2 V , say that c ° -dominates b (with respect to a) if B.b; fag/ ½ B° .c; fag/. The necessary assumption is: Definition (° -Dominator Bounds). Say that a metric space V has a ° -dominator bound if there is a value D° such that if, for any a 2 V and Q ½ V , there is a set QO ½ Q of size no more than D° , such that for every b 2 Q, there is c 2 QO such that c ° -dominates b. In other words, a set C . p; R/, as described in Section 3.1, is contained in an approximation C °0 . p; R/, generated by a finite set QO . For any metric space, D3 D 1. Also, as described in Theorem 6 below, a metric space that has a sphere-packing bound also has a ° -dominator bound. For <k as an L p space, the value D° is, in general, exponential in the dimension k. However, many “naturally occurring” point sets have structure, and that structure may k imply that D° is much smaller than the worst-case bound for < . Theorem 4. Suppose metric space .V ; d/ has a ° -dominator bound, and Q and queries q are such that q is a random element of Q [ fqg. Assume K n · m D n O.1/. Suppose M .S; Q/ is built for some value of ° . Then the probability that M .S; Q/ fails to return the nearest site in S is O.D° log2 n/=K . Proof. The proof follows exactly as for Theorem 2, except for the probability of (2), that is, the probability that no member of Q ° -dominates q. Here the number of members of Q0 needed to ° -dominate all members of Q0 is D° i , rather than just i , and so that the probability that (2) holds is D° i =.K i C 1/ · D° =K . 3.4. Sphere Packing, ° -Dominators, ° -Nearest The mysterious “° -dominator” bound above is implied by less mysterious spherepacking bounds discussed in this section. The sphere-packing bounds also imply some ° -nearest neighbor bounds, which are needed to prove bounds on the query and preprocessing times. While the ° -dominator bounds suggest the interest of considering sites that are ° -nearest to points in Q, the ° -nearest neighbor bounds help show that the number of such ° -nearest sites is not too large. The properties described in this section hold for L p spaces. It is worth emphasizing that the properties are not used in the data structures, but only in their analysis. Definition (Sphere Packing). The space .V ; d/ has a sphere-packing bound if the following holds: for any real number ½, there is an integer constant S½ such that for all a 2 V and W ½ V , if jW j > S½ , and d.w; a/ · D for all w 2 W for some D, then there are w; w0 2 W such that d.w; w0/ < D=½. There is a well-known relation between packing and covering, which we need; for completeness, a proof is given here. Theorem 5. If the metric space .V ; d/ has a sphere-packing bound, then for any set W ½ V contained in a sphere of radius D, there is WO ½ W of size S½ such that, for all w 2 W nWO , d.w; WO / < D=½. Proof. Build sets Wi D fw1; w2; : : : ; wi g point by point, where WS½ D WO . Choose w1 arbitrarily from W , and then, for i > 1, pick for wiC1 the w 2 W which maximizes d.w; Wi /. Thus d. p; Wi / · d.wiC1; Wi / for all p 2 W nWO , and the distances d.wiC1; Wi / are nonincreasing in i . When i D S½ , the sphere-packing bound implies that d. piC1; Wi / · D=½, and so d. p; Wi / · D=½ for all p 2 W . Theorem 6. If the metric space .V ; d/ has a sphere-packing bound with constant S½ , then the space has a ° -dominator bound with constant D° · 1 C dlog.¹/= log.1 ¡ ¹/eS1=¹; where ¹ ´ .° ¡ 1/=.° C 1/. Proof. If ° ¸ 3, we are done, so assume 1 < ° · 3, and so 0 < ¹ · 1=2. Given ° > 1, a 2 V , and Q ½ V , let q0 maximize fd.q; a/ j q 2 Qg. For i D 0 ¢ ¢ ¢ k D dlog.¹/= log.1 ¡ ¹/e, let ri ´ .1 ¡ ¹/i d.q0; a/. Let Qi denote fq 2 Q j riC1 < d.q; a/ · ri g : Apply the previous lemma to each Qi , with ½ D 1=¹, obtaining sets QO i . Then the set QO needed for a ° -dominator bound is the union of the QO i , together with fq0g, as we next show. Suppose q 2 Qk , so d.q; a/ < rk · ¹d.q0; a/. Then, for any z 2 V with d.q; z/ · d.q; a/, by the triangle inequality d.q0; z/ · d.q0; a/ C d.a; q/ C d.q; z/ · .1 C 2¹/d.q0; a/ · ° d.q0; a/; and so B.q; fag/ ½ ° B.q0; fag/. Suppose q 2 Qi , so riC1 < d.q; a/ · ri . Then QO i contains a point q0 such that d.q; q0/ · ¹ri , and so if z has d.z; q/ · d.a; q/, then d.q0; z/ · d.q0; q/ C d.q; z/ · ¹ri C ri D .1 C ¹/ri ; and, since d.q0; a/ ¸ riC1 D ri .1 ¡ ¹/, we have 1 C ¹ d.q0; z/ · 1 ¡ ¹ d.q0; a/ D ° d.q0; a/: Hence B.q; fag/ ½ ° B.q0; fag/. Thus, for any q 2 Q, there is some q0 2 QO with B.q; fag/ ½ ° B.q0; fag/, and a ° -dominator bound holds. We turn now to ° -nearest neighbor bounds, which weakly generalize the longestablished nearest neighbor bounds of Euclidean spaces: Definition (Nearest Neighbor Bounds). Say that metric space .V ; d/ has a nearest neighbor bound if there is a constant N1 such that, for all a 2 V and any W ½ V , N1 bounds the number of b 2 W such that a is a nearest neighbor of b with respect to W . Remark. The <k spaces under L p norms have nearest neighbor bounds. A construction using a fan of narrow cones about a can be used to prove this, or Lemma 7 below can be applied. For the purpose of analyzing the algorithms given here, it would be ideal if a point a was ° -nearest neighbor to a constant number of points of a given set. Unfortunately, this may not be true, even for Euclidean spaces. However, a weaker condition does hold for spaces having a sphere-packing bound, and we use that condition to prove algorithm bounds. This is, roughly, that if a point a is ° -nearest neighbor to many points, then those points must be at a wide range of distances from a. Definition (À.x ; R/). For x 2 V and R ½ V , let À.x ; R/ denote Definition (Neighbor Sets). For x 2 V , and W ½ V , let N° .x ; W / denote the points of W for which x is a ° -nearest neighbor with respect to W . Let n° .x ; W / ´ jN° .x ; W /j. For ° ¸ 1 and given À > 0, let N°;À ´ max ©n° .x ; W / j x 2 V ; W ½ V ; À.x ; W / · Àª ; if such a value exists. Note that N1;À · N1 for any À. Definition (° -Nearest Neighbor Bounds). Say that a metric space .V ; d/ has a ° nearest neighbor bound if it has a nearest neighbor bound, and also N°;À exists for all À > 0. Such a bound is implied by a sphere-packing bound. Lemma 7. If a metric space .V ; d/ has a sphere-packing bound, then it also has a ° -nearest neighbor bound, with N°;À · S2° lg À. Proof. Given x 2 V and W ½ V , for i D 0 : : : dÀ.x ; W /e, let W i ´ ©y 2 W j 2i d.x ; W / · d.x ; y/ · 2iC1d.x ; W /ª : Now using Theorem 5, at most S2° points of W i have x as a ° -nearest neighbor, and so at most S2° lg À.x ; W / points of W have x as nearest neighbor. The following technical lemma will be useful. Lemma 8. For x ; y 2 V and R ½ V , n° .x ; R [ fyg/ · N°;À.x;R/ C 1: That is, we can cheat in the quantity À.; /, and not pay too much. Proof. Trivial. 3.5. Query Time Analysis We can now analyze the running time of the search procedure for M .S; Q/. First, a definition. Definition (the Distance Ratio 7 ). ratio Given finite W ½ V , let 7 .W / be the distance max fd.x ; y/ j x ; y 2 W g min fd.x ; y/ j x ; y 2 W; x 6D yg : Of course, À.x ; W / · 7 .W / for any x 2 W . All our bounds could use maxfÀ.x ; W / j x 2 W g instead of 7 .W /, but the latter has a simpler and more familiar definition. Theorem 9. Suppose metric space .V ; d/ has a ° -nearest neighbor bound. Suppose Q, S, and fqg are all random subsets of Q [ S [ fqg. Suppose n D jSj and K n · m D jQj D n O.1/. Then the expected work in answering a query for point q using M .S; Q/, built with parameter ° , given that the returned answer is correct, is O.N°;7 N1 K ln n/; where 7 ´ 7 .S [ Q/. Proof. The work done in answering a query is proportional to the number of members of Aj lists that are considered in the search. When adding site piC1 to make RiC1, an entry is made for piC1 that will result in work for a query point q when there is y 2 Ri and q0 2 Q K i such that the following hold with respect to Ri : (1) y is nearest to q; (2) y is nearest to q0; ( 3 ) piC1 is ° -nearest to q0. That is, in the “arrow” notation, 1 1 ° q ¡! y á q0 ¡! piC1: We want to bound the expected number of such configurations. First observe that this number is the sum, over q00 2 Q K i , of the expected number of such configurations with q0 D q00. That is, the desired number is jQ K i j D K i times the expected number of such configurations, for a random choice of q0 from Q K i . The exchangeability assumptions for S, Q, and fqg allow us to pick Q K i , Ri , and q in a few steps: choose a random subset Q0 ´ Q K i [ Ri [ fqg ½R Q [ S [ fqg, then choose R00 ½R Q0 of size i C 2, then pick q0 2R R0, and finally q 2R R0nfq0g. The set Q K i is then Q0 [ fq0gnR0. For a given piC1, the number of q00 2 R0 with piC1 as ° -nearest with respect to R0 is N°;À.piC1;R0/, by Lemma 7, or 1 C N°;À.piC1;R0nfqg/, by Lemma 8, and so the probability that a random q0 2 R0 has piC1 ° -nearest is .1 C N°;À.piC1;R0nfqg//=.i C 2/. Note that À. piC1; R0nfqg/ · 7 .S [ Q/, so the probability is at most .1 C N°;7 /=.i C 2/. For any given such q0, with y 2 Ri nearest to q0 in Ri , the number of b 2 R0nfq0g that have y as nearest neighbor is at most N1, since .V ; d/ has a nearest neighbor bound. The probability that a random q 2 R0nfq0g has y nearest is N1=.i C 1/. The probability that (1)–( 3 ) hold for random q and q0 is thus N°;7 N1=.i C 2/.i C 1/, and so the expected number of configurations is K i times this quantity, or O.K = i /. This bounds the work for random query q, associated with piC1 on the Aj list for site y 2 Ri . Summing this bound for i C 1 from 1 to n yields the result. 3.6. Space Bounds for M .S; Q/ The following lemma is needed for proving space bounds, and for proving bounds on the preprocessing time. Lemma 10. Suppose .V ; d/ is a metric space that has a sphere-packing bound, and jV j D n, with distance ratio 7 ´ 7 .V /. For R a random subset of V of size i , p 2 V nR, and ¯ > 1, the expected number of q 2 V nR with p as ¯-nearest neighbor in R is O.N¯;7 /n= i D O.S2¯ log 7 /n= i: Proof. The proof is similar to that of Theorem 9. We observe that the desired quantity is n ¡ i times the probability that a random ¯ choice q 2R V nR has q ¡! p. The random choice of R ½R V and q 2R V nR is equivalent to picking random R0 ½R V of size i C 1, then picking q 2R R0, and finally setting R ´ R0nfqg. By Lemma 7, the number of x 2 R0 with p as ¯-nearest with respect to R0 is at most N¯;7 . The probability that q 2R R0 is one such point is at most N¯;7 =.i C 1/. The result follows, multiplying by n ¡ i . The following lemma will be helpful later. Lemma 11. For random q 2 V , and random R ½R V , the expected number of ¯nearest neighbors of q with respect to R is O.N¯;7 / D O.S2¯ / log 7: Proof. From Lemma 10, for each p 2 R the number of q 2 S with p as a ¯-nearest neighbor is O.N¯;7 n= i /, where i D jRj. Multiplying by i , which is the number of such p, and dividing by n, the number of such q, gives the result. The last equality in the lemma statement is from Lemma 7. Theorem 12. Suppose Q and S are random subsets of Q [ S. When .V ; d/ has a sphere-packing bound, the expected space used by the data structure M.S; Q/ is O.S2° log 7 .S [ Q//K n. Proof. When piC1 is added, it is added to the Aj lists of those pj for which there is some q 2 QiC1 that has pj as nearest neighbor in Ri and piC1 as ° -nearest neighbor in RiC1. Each such q 2 QiC1 yields at most one entry in some Aj list; the expected number of such entries is bounded by applying the previous lemma to Ri as a random subset of V D Ri [ QiC1. Summing over i D 1 ¢ ¢ ¢ n yields the result. 4. Faster Preprocessing for M.S; Q/ This section gives a preprocessing algorithm for M .S; Q/. The algorithm requires O.K n/.log n/2.log 7 .S [ Q/2/ expected time when S is a random subset of S [ Q. The problem in the basic construction of M .S; Q/ is to maintain, for each member of Q, its ° -nearest neighbors in Ri . As we will see, it will be helpful to maintain, in addition, the ° -nearest neighbors of S in Ri : these will help speed up the computation of nearest neighbors when a site is added to Ri . (Of course, some members of S will be in Ri ; for p 2 Ri , as before, we consider its nearest neighbor to be the closest site in Ri nf pg.) Suppose Q and S are random subsets of Q [ S. Then Ri ½R Q [ S, and so here the basic problem is to maintain the ° -nearest neighbors of a set with respect to a random subset of that set. Thus under this assumption, it is no loss of generality to solve this maintenance problem for a set S. The general step of this construction is to update the ° -nearest neighbor relation for RiC1, given that random piC1 2 SnRi has been added to random Ri ½ S, yielding RiC1. To save i ’s, we refer in this section to random R ½ S and random p 2 SnR. We assume that the metric space .V ; d/ has a sphere-packing bound. The first subsection gives some probabilistic lemmas that are used in the analysis, and may help motivate the algorithm that follows. Next follows some geometric notation, and the basic relations that are maintained by the algorithm. The algorithm is given in two main parts, M.1 and M.2, with two auxiliary algorithms then described; finally, the analysis of the running time is completed. Remember that n is the number of elements of S and 7 .S/ is the ratio Hereafter, we refer to 7 .S/ simply as 7 . ° 0 The algorithm will maintain a neighborhood relation H) between points in R. This relation will be used to find the points of S for which a point p is ° -nearest neighbor. ° 0 This information, in turn, will help update the H) relation. 4.2. Probabilistic Lemmas In these lemmas, fix ¯ > 1. Definition ( A¯ ). Let A¯ denote N¯;7 n= i . Lemma 13. O . A¯ /. The expected number of s 2 S with p as a ¯-nearest neighbor in R is Lemma 14. Let R ½R S of size i and p 2R S. The expected number of configurations . p; q ; p0/ with q 2 S, p0 2 R, and <1 ¯ p á q ¡! p0 O . A2¯ / D O ¡S4¯ .log 7 /¢ n= i: Thus Lemma 10 can be restated as follows: That is, when p is added, we can bound the expected number of ¯-nearest neighbors of points q 2 S that have p as the new nearest neighbor. Proof. The number of such configurations is the sum, over p0 2 R, of the expected number involving p0, or i times a bound that holds for any given p0. As in the proof of Theorem 9, the expected number of configurations is n times the expected number involving a random q 2 S. Thus we consider the expected number of configurations for some fixed p0, and for random q 2 S and p 2 R. We assume that q 2= R; the argument when q 2 R is similar. Under these conditions, q and p are random elements of R0 ´ R [ fq; pg, which we can view as chosen by first picking q from R0, and then picking p from R0nfqg. Let n2.q/ denote the second nearest neighbor of q in R0. The problem becomes: what is the probability that random q 2 R0 has d.q; p0/ · ¯d.q; n2.q//, and that the nearest neighbor of q is picked to be p? The latter probability is 1=.i C 1/; the former probability is 1=.i C 2/ times the number of points q0 2 R0 that have d.q0; p0/ · ¯d.q0; n2.q0//. Putting these considerations together, the expected number of configurations of the lemma statement is i n=.i C 1/=.i C 1/ times the size of ©q0 2 R0 j d.q0; p0/ · ¯d ¡q0; n2.q0/¢ª : The number of such q0 can be bounded by S4¯ lg 7 , with a proof similar to that of Lemma 7: separate the sites of R into groups according to their distance from p0, and apply Theorem 5, so that at most S4¯ sites in a group have a nearest neighbor in the group at a distance larger than ¯=2 times their distance to p0. Since V is a metric space, all but S4¯ sites in the group have second nearest neighbors within a distance at most ¯ times their distance to p0. The lemma follows. 4.3. Geometric Notation and Lemmas Before stating the geometric starting point for the algorithm, some notation will be helpful: Definition (d.c/). In the remainder of this section, abbreviate d.c; R/ by d.c/. 1 Definition (Ma). For a 2 R, let Ma ´ maxfd.s; a/ j s ¡! a; s 2 Sg. ¯ ¯ Definition (a H) b). For a; b 2 R, let a H) b denote the condition that d.a; b/ · ¯ Ma. All these definitions are with respect to R, which is left implicit. When considering a <1 site p 2= R, the sites s 2 S with s ¡! p are those which have p as the nearest neighbor in R [ f pg. 2 For a; b 2 R, the relation a H) b is akin to a and b being Delaunay neighbors: if 1 1 there is some s 2 S with a á s ¡! b, then a and b are Delaunay neighbors and both 2 2 a H) b and b H) a. ¯ 1 ¯C1 Note that if a H) b, then the site s realizing Ma has a á s ¡! b, since, by the triangle inequality, d.s; b/ · d.s; a/ C d.a; b/ · .1 C ¯/d.s/. The following lemmas will be helpful. ¯ ± Lemma 15. If x á x 0 ¡! y0, then d.x ; y0/ · .¯ C ±/d.x 0/. Proof. Using the triangle inequality, d.x ; y0/ · d.x ; x 0/ C d.x 0; y0/ · ¯d.x 0/ C ±d.x 0/ D .¯ C ±/d.x 0/: ¯ ± ¯C±C¯± Lemma 16. If x 0 ¡! y ¡! z for z 2 R, then x 0 ¡¡¡¡! z. 1 Proof. Let a 2 R have a á x 0. From the previous lemma, and so and the result follows. d.y/ · d.a; y/ · .1 C ¯/d.x 0/; d.x 0; z/ · d.x 0; y/ C d.y; z/ · ¯d.x 0/ C ±.1 C ¯/d.x 0/; ¯ Lemma 17. The number of a 2 R with a H) p is no more than the expected number ¯C1 of s 2 S with s ¡! p, which is O. A¯C1/. ¯ ¯C1 Proof. For each a 2 R with a H) p, as noted there is some s with s ¡! p. Since 1 s ¡! a for only one a 2 R, the lemma follows. 4.4. Algorithm Structure Lemma 16 may help motivate the following definition. Definition (° 0). Fix a parameter ® with 1 · ® · ° , and let ° 0 ´ 1 C ® C ° C ®° . 1 The algorithm will maintain the nearest neighbor relation s ¡! f for each f 2 R, ° 0 ° 0 and a subset of the relation a H) f for each a; f 2 R. That is, a H) f will be known for all a; f 2 R for which there are s; c 2 S with 4.5. Heaps of Relations In large part, the algorithm is simply the maintenance of data structures representing the ¡! and H) relations. These will comprise, for each f 2 R, some sets of sites of S, each in a heap (priority queue) with the maximum key value on top. The heaps are: 1 Hf Ã: fs 2 S j s ¡! f g, with key values d. f; s/; ° 0 Hf ): fa 2 R j f H) ag, with key values ¡d.a; f /; ° 0 Hf (: fa 2 R j a H) f g, with key values Ma. The heap Hf ) stores the a with minimum d.a; f / on top of the heap. Heaps are useful here because a simple procedure allows entries with key values greater than a given value X to be found in constant time for each such entry. The procedure simply checks if the top of the heap has key value greater than X ; if so, that value is reported, and recursively the children of the top are checked. Note that the keys of entries in Hf ( are the keys of the tops of heaps HaÃ, when ° 0 a H) f . (A fine point here is that the key for a in Hf ( is the value of Ma when a is inserted in Hf (; that is, the value of Ma may be allowed to change without updating the key for a immediately. This issue is discussed in Section 4.8.) ° 0 4.6. Finding ° Neighbors Using H) and d.s/ ¸ d.a; p/=.1 C ° /: ° Check if d.s; p/ · ° d.s/, so that s ¡! p. Lemma 18. Algorithm M.1 finds the sites with p as the ° -nearest neighbor in R. Proof. Suppose that a Ã1¡ s ¡!° p ¡!1 f .1/ ° holds. The relation in the middle, s ¡! p, must be found using stored information. 2° C1 By Lemma 16, the above implies s ¡! f , and so d.a; f / · .2 C 2° /d.s/ using ° 0 ° 0 Lemma 15. Hence d.a; f / · ° 0d.s/ · ° 0 Ma, and so a H) f . Thus, a H) f is ° 1 ° necessary for s ¡! p. Finally, a á s ¡! p implies that d.a; p/ · .1C° /d.s/, using Lemma 15. Lemma 19. The expected work by Algorithm M.1 is O. A° 0C1/. Proof. The number of sites a 2 R inspected by the algorithm is bounded using Lemmas 17 and 13. The s examined all have d.s; p/ · d.s; a/ C d.a; p/ · .2 C ° /d.s/; ° C2 and so all s examined have s ¡! p. The result follows. ° 0 4.7. Finding H) Relations of p With knowledge of the sites for which p is a ° -nearest neighbor comes the knowledge of those sites for which p is a nearest neighbor. (Also the value of Mp is found.) These sites must have their nearest neighbor sites changed to p; however, it will be essential to retain knowledge of their former nearest neighbors when finding the sites a 2 R for ° 0 ° 0 which p H) a or a H) p after p is added. (Note that the exposition below retains the ¡! and H) relations with respect to R, before p is added.) ° 0 Since the relation g H) a depends on Mg, and Mg may change when p is added, the heaps for g must be updated for such changes. This issue is discussed in Section 4.8 below. ° 0 ° 0 Algorithm M.2 (Find All a with p H) a or a H) p). For each g 2 R and c 2 S 1 <1 with g á c ¡! p, use Hg) to find all a with ° 0 g H) a and d.a; g/ · ° 0d.c; g/; ° 0 and check if d.c; a/ · .° 0 ¡ 1/d.c; p/. If so, record p H) a. (This may not record all ° 0 a for which p H) a, but it does record all those needed for the correctness condition; heuristically, the fewer a recorded, the better.) Make the set G where Using the algorithm of Section 4.9, build a set GO ½ G, of size independent of n, with the ® ® property that for all c with c ¡! p there is gO 2 GO with c ¡! gO. Now for each gO 2 GO , use HgO( to find all a with ° 0 a H) gO and Ma ¸ d. p; gO/=2®.1 C ° /; ° 0 and check if a H) p. ° 0 (The test for a H) p could use the stringent condition that there is some s with 1 ° 0¡1 1 a á s ¡! p, and still be correct; moreover, all s ¡! a with d.a; s/ ¸ d.a; p/=° 0 ° 0¡1 could be tested for s ¡! p within the desired time bounds. This might give a heuristic improvement.) ° 0 This algorithm updates the subset of H) that is promised to be maintained. Lemma 20 (Correctness of Algorithm M.2). The above algorithm finds all a such that there are s and c with ® ° <1 a á s á c ¡! p; Case I: a î¡ s ð¡ c ¡<!1 p. Here the site g 2 R with c ¡!1 g has g H° )0 a known, by induction. Moreover, d.c; a/ · .° 0 ¡ 1/d.c; g/ by Lemma 16, and so d.g; a/ · ° 0 ° 0d.c; g/ by Lemma 15. Hence the conditions under which a is examined for p H) a ° 0 are necessary for that condition to hold. Also, if d.c; a/ · .° 0 ¡ 1/d.c; p/, then p H) a using Lemma 15 and Mp ¸ d.c; p/. Case II: a Ã1¡ s ¡!° c ¡!® p. By induction, any g 2 R with c ¡!® g has a H° )0 g ® recorded. By construction, there is some gO 2 GO examined with c ¡! gO; also, using Lemma 17, d.c/ · d.c; a/ · .1 C ° /d.s/; and using the triangle inequality, d.gO; p/ · 2®d.c/ · 2®.1 C ° /d.s/ · 2®.1 C ° /Ma: ° 0 Hence the conditions under which a is examined for a H) p are necessary. Lemma 21. Algorithm M.2 requires O. A2.° 0C1// expected time to find all a 2 R with ° 0 ° 0 p H) a, and O. A° 00 jGO j/ to find all a 2 R with a H) p, where ° 00 ´ 2C° C3®.1C° /. for some c and g, and have d.a; g/ · ° 0d.c/. By the triangle inequality, d.c; a/ · .° 0 C 1/d.c/, and so the first claim follows from Lemma 14. ° 0 Since a H) gO, d.a; gO/ · ° 0 Ma, and since d. p; gO/ · 2®.1 C ° /Ma, it follows that, for s realizing Ma, d.s; p/ · d.s; a/ C d.a; gO/ C d.gO; p/ · .1 C ° 0 C 2®.1 C ° //d.s; a/; and the second claim follows from simplifying this expression, and applying Lemmas 17 and 13. 1 ° 0 As discussed in Section 4.5, the relations a á s and a H) b are represented using heaps for a 2 R. When p is added, the heaps Hp¤ are created and some heaps Ha¤ are updated to reflect relations involving p. First, a bound on the cost of this operation. Lemma 22. The expected cost of adding relations s ¡!1 p, a H° )0 p, or p H° )0 a is A2.° 0C1/ O.log n/. ° 0 The relations H) continue to hold when p is added, except possibly for g 2 R with 1 <1 g á c ¡! p for some c 2 S. We next look at what heap maintenance must be done to reflect this. The entry for c in Hgà can be deleted in O.log n/ time. ° 0 The heaps Hg) and Ha( have entries a with g H) a, and this relation may become false when p is added. When Hg) is used in Algorithm M.2, the a examined are close to ° 0 g, and the analysis of Lemma 21 holds even when g H) a is false. Hence, no updating of Hg) need be done. Finally, a change in Ha( in Algorithm M.2 must be considered. The approach we take is to do nothing to update Ha(, but instead to verify relations when they are used. ° 0 That is, we verify the relation a H) f claimed by Hf ( in Algorithm M.1, and the ° 0 relation a H) gO claimed by HgO( in Algorithm M.2. If the key for a in Hf ( is still Ma, then no change in Hf ( is needed. If Ma is currently smaller than that key value, but still Ma ¸ d.a; gO/=° 0, then the relation still holds, but the key value must be updated and its heap position changed. If Ma < d.a; gO/=° 0, then a is deleted from HgO(. Similar maintenance can be done for Hf ( in Algorithm M.1. Lemma 23. Verifying and updating the relations in heaps Hf ( and HgO( needs O. A° 00 jGO j log n/ expected time. Here ° 00 is defined as in Lemma 21. ° 0 This ignores the work in deleting entries a for which a H) gO is no longer true. However, this work is done once for each such relation, and therefore can be charged to the insertion time for a in HgO(. Proof. This is simply the O.log n/ work for each a examined by some HgO(. The time to verify Hf ( is dominated by this. 4.9. Finding GO Next we see that if a sphere-packing bound holds, as discussed in Section 3.4, then the set GO exists, and can be found efficiently. The algorithm is described within the proof of the following lemma. Lemma 24. Suppose .V ; d/ has a sphere-packing bound. For p 2 S and R ½ S, let ® S0 denote the set of c 2 S with c ¡! p with respect to R, where ® > 43 . Let dl ´ min ©d.c/ j c 2 S0ª ; dh ´ max ©d.c/ j c 2 S0ª : Divide the interval .dl ; dh / into ranges .®O k dl ; ®O kC1dl / for k D 0 ¢ ¢ ¢ dlog3®=4 À. p; S0/e: For each such k, consider the set Sk of c 2 S0 such that d.c/ 2 .®O k dl ; ®O kC1dl /. Construct a packing Sk0 ½ Sk using Theorem 5, such that d.c; Sk0 / · ®O kC1dl =3 for all c 2 Sk . Since d.c; p/ · ®d.c/ · ®®O kC1dl , the size of Sk0 need be at most S3®. Now for each c 2 Sk , there is some c0 2 Sk0 with d.c; c0/ · ®O kC1dl =3 · ®d.c/=4; 1 and so if c0 ¡! g0, then Then d.c; g0/ · d.c; c0/ C d.c0; g0/ · ®d.c/=4 C ®O kC1dl · ®d.c/=4 C 3®d.c/=4 D ®d.c/: satisfies the conditions of the lemma. Lemma 25. The algorithm for finding GO needs GO ´ [ ng0 2 R j c0 ¡!1 g0; c0 2 Sk0 o k ³ ´ O jGjjGO j D O A®jGO j expected time. Proof. This is clear from the algorithm, and the bound on jGj is from the definition of A®. 4.10. Concluding Analysis Before considering the time complexity, we should note correctness. Theorem 26. Algorithms M.1 and M.2 find the points for which a site p is ° -nearest in R. Proof. This follows from Lemmas 18 and 20. Theorem 27. The all-nearest neighbors problem can be solved for a metric space with a sphere-packing bound in expected time, for parameters ® and ° satisfying 43 < ® · ° . Proof. The bounds given by Lemmas 19, 21–23, and 25 are dominated by O. Amaxf° 00;2.° 0C1/gjGO j log n/. Using the definition of A¯ , we obtain an expected bound on the order of Nmaxf° 00;2.° 0C1/g;7 jGO j log n for the cost of adding p to R of size i . The size of GO is O.S3®.log3®=4 7 // from Lemma 24. Using this bound, the hypothesis that ® > 43 is fixed, and summing over i up to n gives a total expected cost O.n log2 n/Nmaxf° 00;2.° 0C1/g;7 S3® log 7: Finally, from Lemma 7, the total expected cost is or, neglecting constants, O.n.log n/2.log 7 /2/. O.n log2 n/ ¡Smaxf° 00;2.° 0C1/gS3® log2 7 ¢ ; Theorem 28. If Q and S are random subsets of Q [ S, and V has a sphere-packing bound, then M .S; Q/ can be built in O.K n/.log n/2 .log 7 .S [ Q//2 expected time. Proof. Apply the algorithm of this section to S [ Q K n, adding the elements of S first. The resulting ° -nearest neighbor information is exactly what is needed for the data structure. 5. Data Structure D.S/ This section describes another approach to the post office problem, in this same setting, but with some differences: the algorithm always returns the correct answer, and does not require Q, but requires somewhat more space and query time. The algorithm can be analyzed using the exchangeability assumption, where a query point q is a random element of fqg [ S. The algorithm uses a divide-and-conquer scheme apparently somewhat like Brin’s [B]. In this section the values N°;7.R/ are considered for various subsets R of S. We use the uniform bound N°;7.S/, abbreviated just as N° . Recall Lemma 7, that if .V ; d/ obeys a sphere-packing bound, then N°;7.S/ · S2° lg 7 .S/. Preprocessing. Given a set of n sites S ½ V , the data structure D.S/ is built as follows: take a random subset R of S. Recursively build the data structure D.R/ for R, and then find, for each p 2 SnR, its nearest neighbor in R. Let Sa1 denote the sites in SnR that have a 2 R as the nearest neighbor. For each a 2 R, build a list La of sites of R, in nondecreasing order of their distance to a. For each c 2 Sa1, find the 3-nearest neighbors of c in R as follows: Walk down the list La, checking for each site b 2 R if b is 3-nearest to c in R. Stop this traversal of La when d.a; b/ > 4d.c; a/. For a 2 R, let Sa3 denote the set of sites in S with a as the 3-nearest neighbor. Recursively build D.Sa3/. This completes the preprocessing. The data structure D.S/ comprises D.R/, the lists La for a 2 R, and the recursively built data structures D.Sa3/ for a 2 R. For convenience of analysis, the construction uses random subsets of size rk ´ n1=2kC1 at recursion depth k, and terminates the recursion at k D ln ´ dlg lg ne. (The construction of D.S/ for the original input is at depth 0; when constructing D.S/ at depth k, the construction of the data structures D.Sa3/ and D.R/ are at depth k C 1.) The recursion bottoms out by simply storing the sites in a list, at recursion depth ln, or when the set has fewer than some constant number of members. Answering Queries. To find the nearest site to a point q, use the data structure D.R/ to find the nearest neighbor a of q in R. Next find the 3-nearest neighbors of q in R as in the preprocessing. For each b that is the 3-nearest neighbor to q in R, recursively search D.Sb3/ for the closest site in Sb3 to q. Return the closest site found by all such searches. The recursion bottoms out by searching the list of sites in linear time. There are a few correctness conditions to verify. Theorem 29. The query and preprocessing algorithms find all 3-nearest neighbors of q and of sites in SnR . The query procedure returns the closest site in S to q. Rather than store La as a sorted list, as described above, it is enough to store the entries in a heap. With appropriate heap ordering relations, all members of La at a distance closer than a value X can be found in constant time per reported member, using a recursive traversal of the heap, just as for sorted lists. The heaps for sites a 2 R can be built in constant time per entry, however, unlike sorted lists; this might be a bit faster in practice, and simplifies the analysis. Lemma 30. The expected size of a set of sites considered at recursion depth k is O.N3/k n1=2k : Proof. Suppose inductively that the expected size of the set W in the parent recursion, at depth k ¡1, is m D O.N3/k¡1n1=2k¡1 , as is plainly true for k D 1. One branch of recursion considers R ½ W of size n1=2k , which satisfies the claim. Other branches of recursion apply to sets Wa3, which have expected sizes O.N3/m=rk¡1 using Lemma 10. (Note that the expectation for Wa3 is with respect to the random choice of R, while the expectation for jW j is from random choices at lesser recursion depth; hence the expectations can be “mixed” in this fashion.) Since rk¡1 ´ n1=2k , the result follows. Theorem 31. Suppose metric space .V ; d/ has a ° -nearest neighbor bound, and S ½ V is a set of n sites. Suppose q is a random element of fqg [ S. Then the expected time to find the closest site in S to q using D.S/ is N5.lg n/O.1/C2 lg N3 Proof. Consider the query procedure for a set T at depth k in the recursion. Let Z .k/ denote the expected time for the query procedure applied to a set at depth k, so that the expected time overall is Z .0/. Let R be the random subset of T used at that step. The set T is a random subset of S, subject to the 3-nearest neighbor conditions used in passing down the recursion. Since q is a random element of fqg [ S, and subject to the same 3-nearest neighbor conditions (or else T would not be considered for q), q is a random element of fqg [ R. Thus the expected time to search D.R/ is Z .k C 1/. Using the triangle inequality, each site b examined in finding 3-nearest neighbors of q satisfies d.b; q/ · d.b; a/ C d.a; q/ · 5d.a; q/: From Lemma 11, the expected work in examining such b is O.N5/. The expected work done recursively can be expressed as the sum, over b 2 T , of Z .k C 1/, times the probability that b is in R and is a 3-nearest neighbor of q. The correctness of this claim requires the observation that q is a random element of fqg [ Tb3. The probability that b is a 3-nearest neighbor of q in R is O.N3/=jT j, using Lemma 11. The expected query time is therefore for k smaller than the recursion depth ln and jT j large. From Lemma 30, the expected size of the sets T at the bottom level is O.N3/ln , and this bounds the expected cost at the bottom level of recursion. The overall expected cost is O.N5/O.N3 C 1/ln O.N3/ln ; which implies the expected time bound claimed. Theorem 32. Suppose metric space .V ; d/ has a ° -nearest neighbor bound, and S ½ V is a set of n sites. Then the expected time to build D.S/ is O.n/N5.lg n/O.1/C2 lg N3 Proof. The general step of the algorithm builds a data structure for a set T at recursion depth k. Let Y .k/ denote the expected time needed to build the data structure for such a set T . The expected cost of building D.R/ is Y .k C 1/; the expected cost of finding the nearest neighbors in R of each site can be bounded at Z .k C 1/ per site; the expected cost of finding 3-nearest neighbors in R of each site is O.rk2/ C O.N5/jT j, since each b examined in a list La for site c 2 Ta1 is a 5-nearest neighbor to c; the number of such neighbors is bounded using Lemma 11, averaging over all c 2 T nR. Construction of the heaps La, for a 2 R, requires O.rk2/ D O.n1=2k / altogether. Finally, the work in building the data structures D.Tb3/ requires time proportional to the sum, over b 2 T , of the probability rk =jT j that b 2 R ½ T , times the expected work Y .k C 1/ for Tb3. Thus the work Y .k/ satisfies Y .k/ · Y .k C 1/ C Z .k C 1/jT j C O.N5/jT j C O.n1=2k / C rk Y .k C 1/; or using the bound E jT j · O.N3/k n1=2k of Lemma 30, Y .k/ · .Z .k C 1/ C O.N5// O.N3/k n1=2k C O.n1=2k / C .n1=2kC1 C 1/Y .k C 1/; with Y .ln/ · O.N3/ln , giving the result claimed. Theorem 33. Suppose metric space .V ; d/ has a ° -nearest neighbor bound, and S ½ V is a set of n sites. Then the expected space for D.S/ is O.n/.lg n/O.1/Clg N3 5.1. Remarks Proof. The total space requires for all lists La at each depth k is O.n/; the total expected space required for sets at the bottom of the recursion is n O.N3/ln , with ln ´ dlg lg ne, and this gives the bound. The algorithm recursively builds data structures for sets Sb3, the sites having b as the 3-nearest neighbor in R, rather than for Sb1, those having b as the nearest neighbor. The query procedure would still be correct if the latter procedure were followed, and the space required would be O.n/, since the sites would be partitioned at each step. So far, it has not been clear how to analyze such an attractive algorithm. It is a bit artificial to have a fixed schedule of sample sizes, rk ´ n1=2kC1 ; a more natural sample size would be simply to use the square root of the set size jT j. This seems to be more difficult to analyze, and still seems artificial: why not use a constant sample size, or n=2? However, a smaller sample size would have an even bigger blowup in cost, due to the imperfect nature of the divide-and-conquer scheme used. A larger sample size runs into the apparent need for Ä.r 2/ space, to hold the lists La that are used to find 3-nearest neighbors. While the choice of r is not tuned to be best possible, it seems close to the only alternative with the ideas available. In <k space with an L p norm, the expected query time takes the form .lg n/O.k/Clg lg 7 : 5.2. ° -Queries and Inverse Queries by N3C2° , and N3 by N1C2° . The data structure can be extended in two ways that are sometimes useful. The first extension is to allow finding all ° -nearest neighbors of a query point, not just the nearest neighbors. The other extension is to allow inverse queries, for a point p and subset R ½ S, that return sites z 2 SnR for which p is ° -nearest to z in R. Here we only sketch the straightforward extension to build a data structure D° .S/ that allows ° -nearest neighbor queries; the change is basically to replace finding 3-neighbors by finding .1 C 2° /-neighbors, to maintain Sa1C2° instead of Sa3, to search lists only when d.a; b/ · .2 C 2° /d.b; q/ instead of a multiplier of 4, and to replace N5 in the analysis Theorem 34. Suppose metric space .V ; d/ has a ° -nearest neighbor bound, and S ½ V is a set of n sites. The expected time to build D° .S/ is as n ! 1, and the expected query time is Here and O.n/ A.lg n/O.1/C2 lg B A.lg n/O.1/C2 lg B : A D N3C2°;7.S/ B D N1C2°;7.S/; which are O.log 7 .S// as a function of S when .V ; d/ has a sphere-packing bound. Given random R ½ S, it is possible to build a data structure for inverse neighbor queries: given point p, quickly find all q 2 SnR such that p is a ° -nearest neighbor to q in f pg [ R. One approach is the following: build the data structure D1C° .R/, and use it to find the nearest neighbor, and all .1 C ° /-nearest neighbors, in R of all q 2 SnR. For each c 2 R, find the sets Sc1C° of q that have c as .1 C ° /-nearest neighbor in R, and the analogous set Sc1. Store these sets in heaps, so that all q in them with d.q; c/ greater than some value can be found in constant time per Q. With this preprocessing, for given p, find its .1C° /-nearest neighbors in R, including its nearest neighbor a. Examine all q 2 Sa1C° that have d.a; q/ ¸ d.a; p/=.1 C ° /, and check if they have p as ° -nearest. For each b 2 R that is .1 C ° /-nearest to p in R, examine the set Sb1 of q that have b as nearest in R, and check all such q that have d.b; q/ ¸ d.a; p/ and d.b; q/ ¸ d.b; p/=.1 C ° /, to see if p is a .1 C ° /-nearest neighbor. This completes the query procedure. Theorem 35. The inverse neighbor procedure finds all q with p ° -nearest in f pg [ R. Proof. In the procedure, suppose a is nearest to p, b is nearest to q, and p is a ° -nearest neighbor of q. If d.a; p/ ¸ d.b; q/, then p has b as .1 C ° /-nearest, using the triangle inequality, while if d.a; p/ · d.b; q/, then q has a as .1 C ° /-nearest. Moreover, if d. p; q/ · ° d.b; q/, then d. p; b/ · d. p; q/ C d.q; b/ · .1 C ° /d.b; q/, and similar reasoning applies to q 2 Ta1C° with d. p; q/ · ° d.a; q/. Theorem 36. Suppose metric space .V ; d/ has a ° -nearest neighbor bound, and S ½ V is a set of n sites. Then the preprocessing time for the above data structure for inverse ° -nearest neighbor queries is the same as for .1 C ° /-neighbor queries, and if p is random in f pg [ S, then the expected query time is the time for a .1 C ° /-neighbor query, plus O.N1C°;7.S// C O.N2C°;7.S//jSj=jRj. Proof. Every b examined is a .1 C ° /-nearest neighbor of p, and the expected number of such b is bounded by Lemma 11. Every q examined has p as a .2 C ° /-nearest neighbor, and is examined at most twice, once as a member of Tb1 and once as a member of Sa1C° . The result follows from Lemma 10. Conclusions This paper has shown that nearest neighbor problems can be solved efficiently, even when no information about the distance measure is given explicitly. Versions of some of the algorithms given here have been implemented, with promising results which will be reported elsewhere. It is worth noting that the failure probability analysis and query time analysis of M .S; Q/ do not depend on the triangle inequality, but only the sphere-packing properties, and there could be .V ; d/ which is not even a metric space, but for which these results hold, or for which the algorithms are effective in practice. While it is satisfying that the preprocessing for M .S; Q/ is nearly linear, the time bound is higher than one would like, and, worse, the space for the construction degrades to Ä.n2/ in high dimension. Is there a different construction without this flaw? Acknowledgments References I would like to thank the anonymous referees for their thoughtful comments. [B] S. Brin. Near neighbor search in large metric spaces. In Proc. 21st Internat. Conf. on Very Large Data Bases, pages 574–584, 1995. [C1] K. L. Clarkson. Fast algorithms for the all nearest neighbors problem. In Proc. 24th Ann. IEEE Symp. on Foundations of Computer Science, pages 226–232, 1983. [C2] K. L. Clarkson. A randomized algorithm for closest-point queries. SIAM J. Comput., 17:830–847, 1988. [C3] K. L. Clarkson . Randomized geometric algorithms . In D.- Z. Du and F. K . Hwang, editors, Computing in Euclidean Geometry , pages 117 - 162 . Lecture Notes Series on Computing, volume 1 . World Scientific, Singapore, 1992 . [FC] C. Faloutsos and V. Caede . Analysis of n-dimensional quadtrees using the Hausdorff fractal dimension . In Proc. 22nd Internat. Conf. on Very Large Data Bases , 1996 . [FS] C. D. Feustel and L. G. Shapiro . The nearest neighbor problem in an abstract metric space . Pattern Recognition Lett. , 1 : 125 - 128 , December 1982 . [MR] R. Motwani and P. Raghavan . Deferred data structuring: query-driven preprocessing for geometric search problems . In Proc. 2nd Ann. ACM Symp. on Computational Geometry , pages 303 - 312 , 1986 . [M] K. Mulmuley. Computational Geometry : An Introduction Through Randomized Algorithms . PrenticeHall, Englewood Cliffs, NJ, 1993 . [U] J. K. Uhlmann . Satisfying general proximity/similarity queries with metric trees . Inform. Process. Lett. , 40 : 175 - 179 , 1991 . [V] P. M. Vaidya . An O.n log n/ algorithm for the all-nearest-neighbors problem . Discrete Comput. Geom. , 4 : 101 - 115 , 1989 . [Y] P. N. Yianilos . Data structures and algorithms for nearest neighbor search in general metric spaces . In Proc. 4th ACM-SIAM Symp. on Discrete Algorithms , pages 311 - 321 , 1993 .


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2FPL00009449.pdf

K. L. Clarkson. Nearest Neighbor Queries in Metric Spaces, Discrete & Computational Geometry, 1999, 63-93, DOI: 10.1007/PL00009449