Nearest Neighbor Queries in Metric Spaces
Discrete Comput Geom
Nearest Neighbor Queries in Metric Spaces
K. L. Clarkson 0
0 Bell Laboratories , Lucent Technologies, Murray Hill, NJ 07974 , USA
Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives data structures for this problem when the sites and queries are in a metric space. One data structure, D.S/, uses a divideandconquer recursion. The other data structure, M .S; Q/, is somewhat like a skiplist. Both are simple and implementable. The data structures are analyzed when the metric space obeys a certain spherepacking bound, and when the sites and query points are random and have distributions with an exchangeability property. This property implies, for example, that query point q is a random element of S [ fqg. Under these conditions, the preprocessing and space bounds for the algorithms are close to linear in n. They depend also on the spherepacking bound, and on the logarithm of the distance ratio 7 .S/ of S, the ratio of the distance between the farthest pair of points in S to the distance between the closest pair. The data structure M .S; Q/ requires as input data an additional set Q, taken to be representative of the query points. The resource bounds of M .S; Q/ have a dependence on the distance ratio of S [ Q. While M .S; Q/ can return wrong answers, its failure probability can be bounded, and is decreasing in a parameter K . Here K · jQj=n is chosen when building M .S; Q/. The expected query time for M .S; Q/ is O.K log n/ log 7 .S [ Q/, and the resource bounds increase linearly in K . The data structure D.S/ has expected O.log n/O.1/ query time, for fixed distance ratio. The preprocessing algorithm for M .S; Q/ can be used to solve the all nearest neighbor problem for S in O.n.log n/2.log 7 .S//2/ expected time.

This paper addresses algorithmic questions related to the post office or nearest neighbor
problem:
Let .V ; d / be a metric space, where V is a set and d is a distance measure on V .
Given a set S ½ V of n sites (points), build a data structure so that given a query
point q 2 V , the nearest site to q can be found quickly.
Two data structures are given here for this problem. One data structure, M .S; Q/, requires
an additional set Q of m points, taken to be representative of typical query points. The
data structure M .S; Q/ may fail to return a correct answer, but the failure probability
can be made arbitrarily small. This decrease requires a proportional increase in query
time and data structure space. The other data structure, D.S/, always returns a correct
answer and does not need the set Q, but its provable resource bounds are worse.
The data structure M .S; Q/ has been implemented, with some preliminary tests. For
example, when the points of S, Q, and q are uniformly distributed in a square, and the
distance measure is Euclidean (`2), a version of the data structure gives the correct answer
for all of 500 tests. For this version, searching requires about 21 distance evaluations
for jSj D 2000, and the space required is about 8 integers=site. (The data structure
implemented uses K D 1000, ° D 1:2, and uses an “early termination” to a collection
of sites to be searched by brute force.) Note that the algorithm uses the distance measure
as a “black box.” With 4000 similarly distributed points in <20, an average search time
of 604 distance evaluations gives a site for which about 0.6 sites are closer, on average,
and at an average distance 2% farther than the true nearest neighbor distance.
The provable bounds require some general conditions on the data, and on the metric
space. The failure probability bounds hold for M .S; Q/ when q is a random element of
Q [ fqg; we call this requirement on q and Q an exchangeability condition. When Q and
q are each generated by some random process, each with their probability distribution,
the exchangeability condition means that if we are given Q and q, and then take a random
element of Q [ fqg, then that element will have the same probability distribution as q.
The bounds on the query time hold when Q, S, and fqg satisfy the exchangeability
condition that they are each random subsets of Q [ S [ fqg.
The proofs of the bounds also require that the metric spaces have certain nearest
neighbor bounds, which are implied by spherepacking bounds; such properties are
described in Section 3.4. In particular, <k has these properties under L p metrics. The
constants associated with these properties appear in the bounds for the algorithms, and
are in general exponential in k.
Many of the provable bounds also involve the distance ratio, denoted 7 .T / for a set
T , which is the ratio of the distance between the farthest pair of points in T to distance
between the closest pair in T . The quantity 7 .S/ appears in bounds for D.S/, and
the quantity 7 .S [ Q/ appears in bounds for M .S; Q/. The dependence is relatively
mild, generally O.log 7 /O.1/. It is reasonable in practice to assume the rough relation
7 D n O.1/, implying that “all logs are equal.”
The preprocessing for D.S/ needs
O.n/.lg n/O.lg lg 7.S//
expected time, resulting in a data structure that answers queries q with q 2R S [ fqg,
that is, q is a random element of S [ fqg, in expected O..lg n/O.1/C2 lg lg.7.S/// time,
and needing O.n.lg n/O.1/C2 lg lg.7.S/// expected space. (The dependence on the
spherepacking bounds for the metric space is suppressed here.)
The preprocessing for M .S; Q/ needs
O.K n.log n/2/.log 7 .S [ Q/2/
expected time, yielding a data structure that answers queries in
expected time, with failure probability O.log2 n/=K , and needing space
O.K ln n/ lg 7 .S [ Q/
O.K n log 7 .S [ Q//:
Either preprocessing method also solves the all nearest neighbor problem: find, for
each site in S, the nearest other site. Algorithms for the all nearest neighbor problem that
need O.n log n/ time have been known since 1983 for <k [C1], [V], but the algorithms
given here are the first with nearlinear bounds that use the distance measure alone; in
particular, they do not use quadtrees or quadtreelike spatial subdivisions as in previous
work.
1.1.
Why Metric Spaces?
Recall that if .V ; d/ is a metric space, then, for all a; b; c 2 V , d.a; a/ D 0, d.a; b/ D
d.b; a/, and the triangle inequality d.a; c/ · d.a; b/ C d.b; c/ holds.
The approach taken here is to find algorithms that can be applied to general metric
spaces, but that have provable properties for important special cases. There are several
reasons for considering the nearest neighbor problem in this generality. While
closestpoint problems have applications in statistics, data compression, information retrieval,
and other areas, many such applications are in a highdimensional setting, for which
almost all known solutions either give a slowdown over the naive algorithm, or use too
much space. On the other hand, highdimensional data often has useful structure. It
may lie in a lowerdimensional hyperplane or flat, or manifold, for example. Still, such
structure is not always readily apparent, as when the data lies in a manifold that is not flat,
or has fractal properties [FC]. Hence it is desirable to seek algorithms whose complexity
does not depend strictly on the dimension, or on using most coordinates, but rather on
intrinsic properties of the data.
There are other reasons to consider general metric spaces: some distance measures
are not applied to spaces with real coordinates, such as edit distance on strings. (We
should note, however, that strings do not seem to have a useful spherepacking bound.)
Moreover, some spaces of sites and queries have coordinates that are restricted to small
sets, so that intuitions or constructions from Euclidean space are not necessarily helpful.
If all coordinates are 0 or 1, then distance evaluations between points are faster on
real machines, using bittwiddling, than other coordinatewise operations. An algorithm
that uses the distance measure, and only as a “black box,” will probably not be too
complicated: there are no operations on which such complexity can depend. Finally,
from a theoretical point of view, it seems reasonable to strip the problem down to its
essentials, and find the minimal properties needed for fast algorithms.
Algorithms operating in general metric spaces have been considered for some time
[FS] and there are some more recent contributions as well [B], [U], [Y]. Most of the
proposed data structures are trees, and some are variations on the kdtree, perhaps the most
important practical method for nearest neighbor searching in <k , for small dimension k.
The data structure D.S/ uses divideandconquer in a somewhat similar way; in contrast,
the data structure M .S; Q/ is somewhat akin to a skip list.
1.2.
Why Q?
The need for the set Q for data structure M .S; Q/ contrasts unfavorably with most earlier
work on the post office problem, where there is no such limitation. Also, the analysis
requirement that q have the same distribution as Q and S is restrictive. However, in some
significant applications, such as vector quantization or nearest neighbor classification,
the queries have the same or similar distribution as the sites, and so the set Q is simply
more training data.
In other cases, say for coordinate spaces, the representative queries that are needed
can be at the least generated uniformly within a region containing the sites. Also, such a
set Q is available in a “deferred data structuring” setting [MR]. Here no preprocessing
is done, and a single query is answered with the naive algorithm. As more queries are
answered, however, the set Q is built up, and the data structures described here can be
used to speed up subsequent queries, under the assumption that future queries have a
similar distribution to past ones.
2.
Outline
After some motivating remarks, the next section describes M .S; Q/, and gives failure
probability bounds, first for general metric spaces, and then for spaces that have a certain
° dominator bound. Such a bound is implied by spherepacking bounds, discussed in
Section 3.4, which also discusses ° nearest neighbor bounds, also implied by
spherepacking bounds. Such ° nearest neighbor bounds are used in Section 9, and in the
following subsection addressing space bounds for M .S; Q/. A provably fast
preprocessing algorithm for M .S; Q/ is discussed in Section 4, including some interesting relations
holding for ° nearest neighbors of ¯nearest neighbors. The data structure D.S/ is
discussed in Section 4, which also includes the use of the data structure for inverse queries,
where the sites that have a given point as nearest neighbor are wanted.
3. Data structure M.S; Q/
3.1.
Motivation
The original idea for the data structure goes back to a paper on the Euclidean version of
the problem [C2], and probably further than that. The idea is to use a small subset R ½ S
to bound search: for a given query q, the distance of q to its nearest neighbor in R gives
an upper bound on the distance to its nearest neighbor in S. That is, the nearest site to q
in S is contained within the nearest neighbor ball B.q; R/ of q with respect to R.
Definition (B.q; R/). For metric space .V ; d/ with R ½ V , the nearest neighbor ball
of q with respect to R is
fx 2 V j d.q; x / · d.q; y/ for all y 2 Rnfqgg :
See Fig. 1.
For a site p 2 R, let C . p; R/ be the union of balls B.q; R/ over all (potential) query
points q with p closest in R. Then, for a given query point q, the knowledge that q is in
the Voronoi region Vor. p/ of p, that is, that p is closest to q in R, allows the search for
nearest sites to q to be restricted to C . p; R/ \ S. (See Fig. 2.)
This suggests a recursive construction, building a data structure for each C . p; R/\S in
a similar fashion. This is the idea of “RPO trees” [C2]. Such a construction is predicated
on the hope that each C . p; R/ \ S is small; this is false in general. However, in the
Euclidean case, when R is a random subset of S, and the Voronoi region Vor. p/ is split
up into combinatorially simple regions, then such a region T will have C .T / \ S small,
with high probability [C2].
However, the latter construction seems to need a triangulation of the Voronoi diagram
of R, which is hard to extend to arbitrary metric spaces. Here we take a simpler approach,
and approximate C . p; R/ with the union
C 0. p; R/ ´
[
q02Vor.p/\Q
B.q0; R/
p
q
of nearest neighbor balls of members of Q. That is, we approximate the construction
of the region C . p; R/ by approximating the Voronoi region of p with Vor. p/ \ Q. Our
hope is that, for most of the query points q that will be encountered, if q is in Vor. p/,
then the set C 0. p; R/ \ S will contain the nearest site to q. (See Fig. 3.)
To compensate for using Vor. p/ \ Q to find C 0. p; R/, rather than Vor. p/, we expand
each nearest neighbor ball by a given factor ° : we approximate C . p; R/ by C °0 . p; R/,
which is the union of the balls B° .q; R/. Such a ball is the expansion of B.q; R/ by the
factor ° .
Definition (B° .q; R/). For metric .V ; d/ with R ½ V , the ° nearest neighbor ball of
q with respect to R is
fx 2 V j d.q; x / · ° d.q; y/ for all y 2 Rnfqgg :
Heuristically, this seems like a reasonable way to be more likely to get a correct
answer. We can say more, however, as the following lemma states.
Lemma 1. Suppose measure d satisfies the triangle inequality. If points q and q0 have
p nearest in R, and d.q; p/ < d.q0; p/, then B.q; R/ ½ B3.q0; R/.
Proof. Suppose p0 is closer to q than p. Then
d. p0; q0/ · d. p0; q/ C d.q; p/ C d. p; q0/ · 3d. p; q0/:
If, for some query point q, the construction of C 30. p; R/ included a point q0 in the
Vor. p/ \ S, and d.q0; p/ > d.q; p/, then C 30. p; R/ \ S contains the nearest site to q.
(See Fig. 4.)
When the query points are distributed similarly to the points of Q, the probability that
some C 30. p; R/ will contain B.q; R/ is bounded by r=jQj: roughly, a random member
of Q has this probability of being “covered” in this sense, and the query point is will
behave the same way.
3.2. The Data Structure
The data structure M .S; Q/ is based on these ideas, but uses a more incremental approach.
First, some definitions.
Definition (Nearest Neighbors). For metric space .V ; d/, set R ½ V , point q 2 V , the
nearest neighbor distance of q with respect to R is
d.q; R/ ´ p2Rnfqg
min fd.q; p/g:
A point p 2 R realizing that distance is a nearest neighbor of q in R.
Definition (° Nearest Neighbors). Say that p 2 V is a ° nearest neighbor of q 2 V
with respect to R if d.q; p/ · ° d.q; R/.
That is, p is a ° nearest neighbor of q with respect to R if and only if p 2 B° .q; R/.
It will be convenient at times to use still another notation for proximity.
° °
Definition (q ¡! p). In a metric space .V ; d/, for q; p 2 V , and R ½ V , let q ¡! p
denote the condition that p is a ° nearest neighbor of q with respect to R, that is,
° °
d.q; p/ · ° d.q; R/, and so p 2 B° .q; R/. Also write p Ã¡ q if and only if q ¡! p.
Armed with all these ways of saying the same thing, here is a succinct definition of
M .S; Q/.
Construction of M.S; Q/. The data structure M .S; Q/ for nearest neighbor searching
is, for each site pj 2 S, a list of sites Aj . The data structure has a real parameter ° ,
picked before construction.
Let . p1; p2; : : : ; pn/ be a random permutation of S, and let Ri ´ f p1; p2; : : : ; pi g,
so that Ri is a random subset of S. Shuffle Q with a random permutation as well, so that
Q j is a random subset of Q of size j , for j D 1 ¢ ¢ ¢ m D jQj. It will be helpful to define
Q j ´ Q for j > m.
Define Aj as
n pi j i > j; there exists q 2 Q K i with pj Ã1¡ q ¡!° pi ; with respect to Ri¡1o :
The sites in Aj are in increasing order of index i .
This is the whole definition of M .S; Q/.
While it is satisfying that M .S; Q/ can be so succinctly defined, a more detailed
discussion may be helpful. Consider the subsets Ri to be built incrementally, by picking
pi randomly from SnRi¡1, and adding it to Ri¡1, yielding Ri . (Hereafter the numbering
of the sites will be fixed in this random order.)
Each list Aj starts out empty. When adding site pi to make Ri , append pi to Aj if
there is some q 2 Q K i with pj nearest to q in Ri¡1, and pi is a ° nearest neighbor of
q. That is, for q 2 Q, maintain the nearest neighbor to q in Rj , for j D 1 ¢ ¢ ¢ n. When
adding pi , check each q 2 Q, and record pi as nearest where appropriate, and also add
pi to Aj if q 2 Q K i and pi is ° nearest to q.
What does this construction mean? Note that in the terminology of the last subsection,
we add pi to Aj when pi 2 C °0 . pj ; Ri¡1/, our approximation to the set of sites that may
be nearer to some query point q if q has pj nearest in kRi¡1.
Consider the special case where ° D 1, S ½ < , Q D <k , K ! 1, and the
distributions of S and Q are uniform in a box. Here the construction puts pi on Aj just
when pi takes away part of the Voronoi region of pj , as witnessed by a member of Q K i ;
for large enough K , pi is a Delaunay neighbor of pj in Ri .
The Search Procedure for M.S; Q/. The search procedure is as follows: given query
point q, start with site p1 as the candidate closest site to q. Walk down A1, until a site
pj closer to q than p1 is found. Now pj is the candidate closest site to q; do a similar
walk down Aj . Repeat until the Ak list for some site pk is searched, and no site on that
list is closer than pk . Terminate the search and return pk as closest site.
This is the search procedure.
Note that, when K D 1, so that Q K i D Q for all i , this search returns by construction
the correct closest site for all elements of Q. When the search procedure is applied to
some q 2 Q, if pi appears as a candidate closest site to q, then pi is closest to q in Ri .
The query time and the probability of returning the correct nearest site increase with
K and ° . The average number of members of Q K i with pj nearest in Ri is K .
There are two uses of randomness here: for picking Ri , and for picking Q K i . The use
of random subsets of the sites is similar to its use in some other randomized geometric
algorithms [M], [C3], and helps to speed up the answering of queries. The random subset
Q K i , on the other hand, serves as a proxy for members of some universe of queries, and
aids correctness; intuitively, any possible query point q will have a member q0 of Q K i
not too far away, and hence a site nearest to q will be near to q0.
3.3. Failure Probability Analysis
The following bound on the failure probability holds without any restrictions on the
metric space.
Theorem 2. Suppose Q and q are such that q is a random element of Q [ fqg. Assume
K n < m D n O.1/. If M .S; Q/ is built with ° D 3, then the probability that it fails to
return a nearest site to q in S is O.log2 n/=K .
The proof depends on Lemma 1 and the following lemma.
Lemma 3. Define Q ´ Q K i [ fqg. Assume K n < m D n O.1/. With probability
1 ¡ 1=n2, for every q0 2 Q0, the number of sites closer to q0 than its nearest neighbor in
Ri is · D O.log n/n= i .
Proof. The set Q0 is fixed with respect to the random choice of Ri . Suppose q0 2 Q0
and p 2 S are such that there are k points of S closer to q0 than p is. The probability that
p 2 Ri and also that p is the closest site in Ri to q0 is no more than
¡n¡k¡1¢
i¡1
¡n¢
i
i e¡k.i¡1/=.n¡1/:
· n
Since the number of such pairs p and q0 is no more than .K i C 1/.n ¡ k/ < mn, the
claim follows for k > · ´ .3 ln n C ln m/.n ¡ 1/=.i ¡ 1/ D O.log n/n= i .
Proof of Theorem 2. Let q be a random query point. Consider the construction of the
data structure when point piC1 is added to make RiC1, so piC1 is possibly added to some
Aj lists. Suppose pj is closest to q in Ri , but piC1 is closer. Then the query procedure
should change the candidate closest site from pj to piC1, but the procedure is not certain
to do so if piC1 is not in Aj . Conversely, if such appropriate entries in the Aj lists are
present, for each j with 1 · j · n, then the query for q will be answered correctly.
This implies that the probability that the construction fails to produce a data structure
that answers a query correctly is no more than the sum over i , for i D 0 ¢ ¢ ¢ n ¡ 1, of the
probability that a failure occurs when piC1 is added.
Thus we seek an upper bound on the probability that, when piC1 is added:
(1) With respect to Ri ,
but
(2) there is no point v 2 Q K i with
piC1 Ã<¡1 q ¡! pj ;
1
3 1
piC1 Ã¡ v ¡! pj :
Let Q0 ´ Q K i [ fqg. For each pk 2 Ri , let vk be the point in Q0 that has pk as nearest
neighbor in Ri , and that has maximum distance to pk among all such points. If (1) holds,
and q 6D vj , then (2) holds, by Lemma 1. (Put q0 D vj , q D q, and p D pj in the lemma.)
To bound the probability that q D vj for some j , use the condition of the theorem that q
is a random element of Q0K i . The probability that a random element of Q0 is one of the
i points vj is
i =jQ0j D i =.K i C 1/ < 1=K ;
and so the probability of (2), given (1), is at most 1=K .
If some q0 2 Q0 has more than · D .3 ln n C ln m/.n ¡ 1/=.i ¡ 1/ sites closer to it
than its nearest neighbor in Ri , consider the addition of piC1 a failure; this occurs with
probability 1=n2, by Lemma 3. Suppose that the addition of piC1 is not a failure for this
reason, so that every q0 2 Q0 has fewer than · sites closer to it than its nearest neighbor
in Ri . The probability of (1) is then ·=n, for q any member of Q0K i , and so the probability
of failure when adding piC1 is at most
1 · 1
n2 C n K D
O.log n/
i K
;
for i > 0; using the trivial bound for small i , and adding this for i up to n bounds the
overall failure probability at O.log2 n/=K .
A similar bound holds when ° < 3, for metric spaces that satisfy an additional
condition, which can be expressed using the following definition.
Definition (° Dominating). For a; b; c 2 V , say that c ° dominates b (with respect
to a) if B.b; fag/ ½ B° .c; fag/.
The necessary assumption is:
Definition (° Dominator Bounds). Say that a metric space V has a ° dominator bound
if there is a value D° such that if, for any a 2 V and Q ½ V , there is a set QO ½ Q of size
no more than D° , such that for every b 2 Q, there is c 2 QO such that c ° dominates b.
In other words, a set C . p; R/, as described in Section 3.1, is contained in an
approximation C °0 . p; R/, generated by a finite set QO .
For any metric space, D3 D 1. Also, as described in Theorem 6 below, a metric space
that has a spherepacking bound also has a ° dominator bound.
For <k as an L p space, the value D° is, in general, exponential in the dimension k.
However, many “naturally occurring” point sets have structure, and that structure may
k
imply that D° is much smaller than the worstcase bound for < .
Theorem 4. Suppose metric space .V ; d/ has a ° dominator bound, and Q and queries
q are such that q is a random element of Q [ fqg. Assume K n · m D n O.1/. Suppose
M .S; Q/ is built for some value of ° . Then the probability that M .S; Q/ fails to return
the nearest site in S is O.D° log2 n/=K .
Proof. The proof follows exactly as for Theorem 2, except for the probability of (2),
that is, the probability that no member of Q ° dominates q. Here the number of members
of Q0 needed to ° dominate all members of Q0 is D° i , rather than just i , and so that the
probability that (2) holds is D° i =.K i C 1/ · D° =K .
3.4. Sphere Packing, ° Dominators, ° Nearest
The mysterious “° dominator” bound above is implied by less mysterious
spherepacking bounds discussed in this section.
The spherepacking bounds also imply some ° nearest neighbor bounds, which are
needed to prove bounds on the query and preprocessing times. While the ° dominator
bounds suggest the interest of considering sites that are ° nearest to points in Q, the
° nearest neighbor bounds help show that the number of such ° nearest sites is not too
large.
The properties described in this section hold for L p spaces. It is worth emphasizing
that the properties are not used in the data structures, but only in their analysis.
Definition (Sphere Packing). The space .V ; d/ has a spherepacking bound if the
following holds: for any real number ½, there is an integer constant S½ such that for all
a 2 V and W ½ V , if jW j > S½ , and d.w; a/ · D for all w 2 W for some D, then
there are w; w0 2 W such that d.w; w0/ < D=½.
There is a wellknown relation between packing and covering, which we need; for
completeness, a proof is given here.
Theorem 5. If the metric space .V ; d/ has a spherepacking bound, then for any set
W ½ V contained in a sphere of radius D, there is WO ½ W of size S½ such that, for all
w 2 W nWO , d.w; WO / < D=½.
Proof. Build sets Wi D fw1; w2; : : : ; wi g point by point, where WS½ D WO . Choose
w1 arbitrarily from W , and then, for i > 1, pick for wiC1 the w 2 W which
maximizes d.w; Wi /. Thus d. p; Wi / · d.wiC1; Wi / for all p 2 W nWO , and the distances
d.wiC1; Wi / are nonincreasing in i . When i D S½ , the spherepacking bound implies
that d. piC1; Wi / · D=½, and so d. p; Wi / · D=½ for all p 2 W .
Theorem 6. If the metric space .V ; d/ has a spherepacking bound with constant S½ ,
then the space has a ° dominator bound with constant
D° · 1 C dlog.¹/= log.1 ¡ ¹/eS1=¹;
where ¹ ´ .° ¡ 1/=.° C 1/.
Proof. If ° ¸ 3, we are done, so assume 1 < ° · 3, and so 0 < ¹ · 1=2.
Given ° > 1, a 2 V , and Q ½ V , let q0 maximize fd.q; a/ j q 2 Qg. For
i D 0 ¢ ¢ ¢ k D dlog.¹/= log.1 ¡ ¹/e, let ri ´ .1 ¡ ¹/i d.q0; a/. Let Qi denote
fq 2 Q j riC1 < d.q; a/ · ri g :
Apply the previous lemma to each Qi , with ½ D 1=¹, obtaining sets QO i . Then the set QO
needed for a ° dominator bound is the union of the QO i , together with fq0g, as we next
show.
Suppose q 2 Qk , so d.q; a/ < rk · ¹d.q0; a/. Then, for any z 2 V with d.q; z/ ·
d.q; a/, by the triangle inequality
d.q0; z/ · d.q0; a/ C d.a; q/ C d.q; z/ · .1 C 2¹/d.q0; a/ · ° d.q0; a/;
and so B.q; fag/ ½ ° B.q0; fag/.
Suppose q 2 Qi , so riC1 < d.q; a/ · ri . Then QO i contains a point q0 such that
d.q; q0/ · ¹ri , and so if z has d.z; q/ · d.a; q/, then
d.q0; z/ · d.q0; q/ C d.q; z/ · ¹ri C ri D .1 C ¹/ri ;
and, since d.q0; a/ ¸ riC1 D ri .1 ¡ ¹/, we have
1 C ¹
d.q0; z/ · 1 ¡ ¹ d.q0; a/ D ° d.q0; a/:
Hence B.q; fag/ ½ ° B.q0; fag/.
Thus, for any q 2 Q, there is some q0 2 QO with B.q; fag/ ½ ° B.q0; fag/, and a
° dominator bound holds.
We turn now to ° nearest neighbor bounds, which weakly generalize the
longestablished nearest neighbor bounds of Euclidean spaces:
Definition (Nearest Neighbor Bounds). Say that metric space .V ; d/ has a nearest
neighbor bound if there is a constant N1 such that, for all a 2 V and any W ½ V ,
N1 bounds the number of b 2 W such that a is a nearest neighbor of b with respect to W .
Remark. The <k spaces under L p norms have nearest neighbor bounds. A construction
using a fan of narrow cones about a can be used to prove this, or Lemma 7 below can
be applied.
For the purpose of analyzing the algorithms given here, it would be ideal if a point
a was ° nearest neighbor to a constant number of points of a given set. Unfortunately,
this may not be true, even for Euclidean spaces. However, a weaker condition does hold
for spaces having a spherepacking bound, and we use that condition to prove algorithm
bounds. This is, roughly, that if a point a is ° nearest neighbor to many points, then
those points must be at a wide range of distances from a.
Definition (À.x ; R/). For x 2 V and R ½ V , let À.x ; R/ denote
Definition (Neighbor Sets). For x 2 V , and W ½ V , let N° .x ; W / denote the points of
W for which x is a ° nearest neighbor with respect to W . Let n° .x ; W / ´ jN° .x ; W /j.
For ° ¸ 1 and given À > 0, let
N°;À ´ max ©n° .x ; W / j x 2 V ; W ½ V ; À.x ; W / · Àª ;
if such a value exists.
Note that N1;À · N1 for any À.
Definition (° Nearest Neighbor Bounds). Say that a metric space .V ; d/ has a °
nearest neighbor bound if it has a nearest neighbor bound, and also N°;À exists for
all À > 0.
Such a bound is implied by a spherepacking bound.
Lemma 7. If a metric space .V ; d/ has a spherepacking bound, then it also has a
° nearest neighbor bound, with N°;À · S2° lg À.
Proof.
Given x 2 V and W ½ V , for i D 0 : : : dÀ.x ; W /e, let
W i ´ ©y 2 W j 2i d.x ; W / · d.x ; y/ · 2iC1d.x ; W /ª :
Now using Theorem 5, at most S2° points of W i have x as a ° nearest neighbor, and so
at most S2° lg À.x ; W / points of W have x as nearest neighbor.
The following technical lemma will be useful.
Lemma 8. For x ; y 2 V and R ½ V ,
n° .x ; R [ fyg/ · N°;À.x;R/ C 1:
That is, we can cheat in the quantity À.; /, and not pay too much.
Proof. Trivial. 3.5.
Query Time Analysis
We can now analyze the running time of the search procedure for M .S; Q/.
First, a definition.
Definition (the Distance Ratio 7 ).
ratio
Given finite W
½ V , let 7 .W / be the distance
max fd.x ; y/ j x ; y 2 W g
min fd.x ; y/ j x ; y 2 W; x 6D yg
:
Of course, À.x ; W / · 7 .W / for any x 2 W . All our bounds could use maxfÀ.x ; W / j
x 2 W g instead of 7 .W /, but the latter has a simpler and more familiar definition.
Theorem 9. Suppose metric space .V ; d/ has a ° nearest neighbor bound. Suppose
Q, S, and fqg are all random subsets of Q [ S [ fqg. Suppose n D jSj and K n · m D
jQj D n O.1/. Then the expected work in answering a query for point q using M .S; Q/,
built with parameter ° , given that the returned answer is correct, is
O.N°;7 N1 K ln n/;
where 7 ´ 7 .S [ Q/.
Proof. The work done in answering a query is proportional to the number of members
of Aj lists that are considered in the search. When adding site piC1 to make RiC1, an
entry is made for piC1 that will result in work for a query point q when there is y 2 Ri
and q0 2 Q K i such that the following hold with respect to Ri :
(1) y is nearest to q;
(2) y is nearest to q0;
(
3
) piC1 is ° nearest to q0.
That is, in the “arrow” notation,
1 1 °
q ¡! y Ã¡ q0 ¡! piC1:
We want to bound the expected number of such configurations. First observe that this
number is the sum, over q00 2 Q K i , of the expected number of such configurations with
q0 D q00. That is, the desired number is jQ K i j D K i times the expected number of such
configurations, for a random choice of q0 from Q K i .
The exchangeability assumptions for S, Q, and fqg allow us to pick Q K i , Ri , and q
in a few steps: choose a random subset Q0 ´ Q K i [ Ri [ fqg ½R Q [ S [ fqg, then
choose R00 ½R Q0 of size i C 2, then pick q0 2R R0, and finally q 2R R0nfq0g. The set
Q K i is then Q0 [ fq0gnR0.
For a given piC1, the number of q00 2 R0 with piC1 as ° nearest with respect to R0 is
N°;À.piC1;R0/, by Lemma 7, or 1 C N°;À.piC1;R0nfqg/, by Lemma 8, and so the probability
that a random q0 2 R0 has piC1 ° nearest is .1 C N°;À.piC1;R0nfqg//=.i C 2/. Note that
À. piC1; R0nfqg/ · 7 .S [ Q/, so the probability is at most .1 C N°;7 /=.i C 2/.
For any given such q0, with y 2 Ri nearest to q0 in Ri , the number of b 2 R0nfq0g
that have y as nearest neighbor is at most N1, since .V ; d/ has a nearest neighbor bound.
The probability that a random q 2 R0nfq0g has y nearest is N1=.i C 1/.
The probability that (1)–(
3
) hold for random q and q0 is thus N°;7 N1=.i C 2/.i C 1/,
and so the expected number of configurations is K i times this quantity, or O.K = i /. This
bounds the work for random query q, associated with piC1 on the Aj list for site y 2 Ri .
Summing this bound for i C 1 from 1 to n yields the result.
3.6. Space Bounds for M .S; Q/
The following lemma is needed for proving space bounds, and for proving bounds on
the preprocessing time.
Lemma 10. Suppose .V ; d/ is a metric space that has a spherepacking bound, and
jV j D n, with distance ratio 7 ´ 7 .V /. For R a random subset of V of size i , p 2 V nR,
and ¯ > 1, the expected number of q 2 V nR with p as ¯nearest neighbor in R is
O.N¯;7 /n= i D O.S2¯ log 7 /n= i:
Proof. The proof is similar to that of Theorem 9.
We observe that the desired quantity is n ¡ i times the probability that a random
¯
choice q 2R V nR has q ¡! p.
The random choice of R ½R V and q 2R V nR is equivalent to picking random
R0 ½R V of size i C 1, then picking q 2R R0, and finally setting R ´ R0nfqg.
By Lemma 7, the number of x 2 R0 with p as ¯nearest with respect to R0 is at most
N¯;7 . The probability that q 2R R0 is one such point is at most N¯;7 =.i C 1/. The result
follows, multiplying by n ¡ i .
The following lemma will be helpful later.
Lemma 11. For random q 2 V , and random R ½R V , the expected number of
¯nearest neighbors of q with respect to R is
O.N¯;7 / D O.S2¯ / log 7:
Proof. From Lemma 10, for each p 2 R the number of q 2 S with p as a ¯nearest
neighbor is O.N¯;7 n= i /, where i D jRj. Multiplying by i , which is the number of such
p, and dividing by n, the number of such q, gives the result. The last equality in the
lemma statement is from Lemma 7.
Theorem 12. Suppose Q and S are random subsets of Q [ S. When .V ; d/ has
a spherepacking bound, the expected space used by the data structure M.S; Q/ is
O.S2° log 7 .S [ Q//K n.
Proof. When piC1 is added, it is added to the Aj lists of those pj for which there is
some q 2 QiC1 that has pj as nearest neighbor in Ri and piC1 as ° nearest neighbor in
RiC1. Each such q 2 QiC1 yields at most one entry in some Aj list; the expected number
of such entries is bounded by applying the previous lemma to Ri as a random subset of
V D Ri [ QiC1. Summing over i D 1 ¢ ¢ ¢ n yields the result.
4. Faster Preprocessing for M.S; Q/
This section gives a preprocessing algorithm for M .S; Q/. The algorithm requires
O.K n/.log n/2.log 7 .S [ Q/2/ expected time when S is a random subset of S [ Q.
The problem in the basic construction of M .S; Q/ is to maintain, for each member
of Q, its ° nearest neighbors in Ri . As we will see, it will be helpful to maintain, in
addition, the ° nearest neighbors of S in Ri : these will help speed up the computation
of nearest neighbors when a site is added to Ri . (Of course, some members of S will be
in Ri ; for p 2 Ri , as before, we consider its nearest neighbor to be the closest site in
Ri nf pg.) Suppose Q and S are random subsets of Q [ S. Then Ri ½R Q [ S, and so
here the basic problem is to maintain the ° nearest neighbors of a set with respect to a
random subset of that set. Thus under this assumption, it is no loss of generality to solve
this maintenance problem for a set S.
The general step of this construction is to update the ° nearest neighbor relation for
RiC1, given that random piC1 2 SnRi has been added to random Ri ½ S, yielding RiC1.
To save i ’s, we refer in this section to random R ½ S and random p 2 SnR. We assume
that the metric space .V ; d/ has a spherepacking bound.
The first subsection gives some probabilistic lemmas that are used in the analysis, and
may help motivate the algorithm that follows. Next follows some geometric notation,
and the basic relations that are maintained by the algorithm. The algorithm is given in
two main parts, M.1 and M.2, with two auxiliary algorithms then described; finally, the
analysis of the running time is completed.
Remember that n is the number of elements of S and 7 .S/ is the ratio
Hereafter, we refer to 7 .S/ simply as 7 .
° 0
The algorithm will maintain a neighborhood relation H) between points in R. This
relation will be used to find the points of S for which a point p is ° nearest neighbor.
° 0
This information, in turn, will help update the H) relation.
4.2.
Probabilistic Lemmas
In these lemmas, fix ¯ > 1.
Definition ( A¯ ).
Let A¯ denote N¯;7 n= i .
Lemma 13.
O . A¯ /.
The expected number of s 2 S with p as a ¯nearest neighbor in R is
Lemma 14. Let R ½R S of size i and p 2R S. The expected number of configurations
. p; q ; p0/ with q 2 S, p0 2 R, and
<1 ¯
p Ã¡ q ¡! p0
O . A2¯ / D O ¡S4¯ .log 7 /¢ n= i:
Thus Lemma 10 can be restated as follows:
That is, when p is added, we can bound the expected number of ¯nearest neighbors
of points q 2 S that have p as the new nearest neighbor.
Proof. The number of such configurations is the sum, over p0 2 R, of the expected
number involving p0, or i times a bound that holds for any given p0. As in the proof
of Theorem 9, the expected number of configurations is n times the expected number
involving a random q 2 S.
Thus we consider the expected number of configurations for some fixed p0, and for
random q 2 S and p 2 R. We assume that q 2= R; the argument when q 2 R is similar.
Under these conditions, q and p are random elements of R0 ´ R [ fq; pg, which
we can view as chosen by first picking q from R0, and then picking p from R0nfqg. Let
n2.q/ denote the second nearest neighbor of q in R0. The problem becomes: what is
the probability that random q 2 R0 has d.q; p0/ · ¯d.q; n2.q//, and that the nearest
neighbor of q is picked to be p? The latter probability is 1=.i C 1/; the former probability
is 1=.i C 2/ times the number of points q0 2 R0 that have d.q0; p0/ · ¯d.q0; n2.q0//.
Putting these considerations together, the expected number of configurations of the
lemma statement is i n=.i C 1/=.i C 1/ times the size of
©q0 2 R0 j d.q0; p0/ · ¯d ¡q0; n2.q0/¢ª :
The number of such q0 can be bounded by S4¯ lg 7 , with a proof similar to that of
Lemma 7: separate the sites of R into groups according to their distance from p0, and
apply Theorem 5, so that at most S4¯ sites in a group have a nearest neighbor in the
group at a distance larger than ¯=2 times their distance to p0. Since V is a metric space,
all but S4¯ sites in the group have second nearest neighbors within a distance at most ¯
times their distance to p0. The lemma follows.
4.3.
Geometric Notation and Lemmas
Before stating the geometric starting point for the algorithm, some notation will be
helpful:
Definition (d.c/). In the remainder of this section, abbreviate d.c; R/ by d.c/.
1
Definition (Ma). For a 2 R, let Ma ´ maxfd.s; a/ j s ¡! a; s 2 Sg.
¯ ¯
Definition (a H) b). For a; b 2 R, let a H) b denote the condition that d.a; b/ ·
¯ Ma.
All these definitions are with respect to R, which is left implicit. When considering a
<1
site p 2= R, the sites s 2 S with s ¡! p are those which have p as the nearest neighbor
in R [ f pg.
2
For a; b 2 R, the relation a H) b is akin to a and b being Delaunay neighbors: if
1 1
there is some s 2 S with a Ã¡ s ¡! b, then a and b are Delaunay neighbors and both
2 2
a H) b and b H) a.
¯ 1 ¯C1
Note that if a H) b, then the site s realizing Ma has a Ã¡ s ¡! b, since, by the
triangle inequality, d.s; b/ · d.s; a/ C d.a; b/ · .1 C ¯/d.s/.
The following lemmas will be helpful.
¯ ±
Lemma 15. If x Ã¡ x 0 ¡! y0, then d.x ; y0/ · .¯ C ±/d.x 0/.
Proof. Using the triangle inequality,
d.x ; y0/ · d.x ; x 0/ C d.x 0; y0/ · ¯d.x 0/ C ±d.x 0/ D .¯ C ±/d.x 0/:
¯ ± ¯C±C¯±
Lemma 16. If x 0 ¡! y ¡! z for z 2 R, then x 0 ¡¡¡¡! z.
1
Proof. Let a 2 R have a Ã¡ x 0. From the previous lemma,
and so
and the result follows.
d.y/ · d.a; y/ · .1 C ¯/d.x 0/;
d.x 0; z/ · d.x 0; y/ C d.y; z/ · ¯d.x 0/ C ±.1 C ¯/d.x 0/;
¯
Lemma 17. The number of a 2 R with a H) p is no more than the expected number
¯C1
of s 2 S with s ¡! p, which is O. A¯C1/.
¯ ¯C1
Proof. For each a 2 R with a H) p, as noted there is some s with s ¡! p. Since
1
s ¡! a for only one a 2 R, the lemma follows.
4.4. Algorithm Structure
Lemma 16 may help motivate the following definition.
Definition (° 0). Fix a parameter ® with 1 · ® · ° , and let ° 0 ´ 1 C ® C ° C ®° .
1
The algorithm will maintain the nearest neighbor relation s ¡! f for each f 2 R,
° 0 ° 0
and a subset of the relation a H) f for each a; f 2 R. That is, a H) f will be known
for all a; f 2 R for which there are s; c 2 S with
4.5. Heaps of Relations
In large part, the algorithm is simply the maintenance of data structures representing the
¡! and H) relations. These will comprise, for each f 2 R, some sets of sites of S,
each in a heap (priority queue) with the maximum key value on top. The heaps are:
1
Hf Ã: fs 2 S j s ¡! f g, with key values d. f; s/;
° 0
Hf ): fa 2 R j f H) ag, with key values ¡d.a; f /;
° 0
Hf (: fa 2 R j a H) f g, with key values Ma.
The heap Hf ) stores the a with minimum d.a; f / on top of the heap.
Heaps are useful here because a simple procedure allows entries with key values
greater than a given value X to be found in constant time for each such entry. The
procedure simply checks if the top of the heap has key value greater than X ; if so, that
value is reported, and recursively the children of the top are checked.
Note that the keys of entries in Hf ( are the keys of the tops of heaps HaÃ, when
° 0
a H) f . (A fine point here is that the key for a in Hf ( is the value of Ma when a is
inserted in Hf (; that is, the value of Ma may be allowed to change without updating the
key for a immediately. This issue is discussed in Section 4.8.)
° 0
4.6. Finding ° Neighbors Using H)
and
d.s/ ¸ d.a; p/=.1 C ° /:
°
Check if d.s; p/ · ° d.s/, so that s ¡! p.
Lemma 18. Algorithm M.1 finds the sites with p as the ° nearest neighbor in R.
Proof. Suppose that
a Ã1¡ s ¡!° p ¡!1 f .1/
°
holds. The relation in the middle, s ¡! p, must be found using stored information.
2° C1
By Lemma 16, the above implies s ¡!
f , and so d.a; f / · .2 C 2° /d.s/ using
° 0 ° 0
Lemma 15. Hence d.a; f / · ° 0d.s/ · ° 0 Ma, and so a H) f . Thus, a H) f is
° 1 °
necessary for s ¡! p. Finally, a Ã¡ s ¡! p implies that d.a; p/ · .1C° /d.s/, using
Lemma 15.
Lemma 19. The expected work by Algorithm M.1 is O. A° 0C1/.
Proof. The number of sites a 2 R inspected by the algorithm is bounded using
Lemmas 17 and 13. The s examined all have
d.s; p/ · d.s; a/ C d.a; p/ · .2 C ° /d.s/;
° C2
and so all s examined have s ¡! p. The result follows.
° 0
4.7. Finding H) Relations of p
With knowledge of the sites for which p is a ° nearest neighbor comes the knowledge
of those sites for which p is a nearest neighbor. (Also the value of Mp is found.) These
sites must have their nearest neighbor sites changed to p; however, it will be essential
to retain knowledge of their former nearest neighbors when finding the sites a 2 R for
° 0 ° 0
which p H) a or a H) p after p is added. (Note that the exposition below retains the
¡! and H) relations with respect to R, before p is added.)
° 0
Since the relation g H) a depends on Mg, and Mg may change when p is added,
the heaps for g must be updated for such changes. This issue is discussed in Section 4.8
below.
° 0 ° 0
Algorithm M.2 (Find All a with p H) a or a H) p). For each g 2 R and c 2 S
1 <1
with g Ã¡ c ¡! p, use Hg) to find all a with
° 0
g H) a
and
d.a; g/ · ° 0d.c; g/;
° 0
and check if d.c; a/ · .° 0 ¡ 1/d.c; p/. If so, record p H) a. (This may not record all
° 0
a for which p H) a, but it does record all those needed for the correctness condition;
heuristically, the fewer a recorded, the better.)
Make the set G where
Using the algorithm of Section 4.9, build a set GO ½ G, of size independent of n, with the
® ®
property that for all c with c ¡! p there is gO 2 GO with c ¡! gO. Now for each gO 2 GO ,
use HgO( to find all a with
° 0
a H) gO
and
Ma ¸ d. p; gO/=2®.1 C ° /;
° 0
and check if a H) p.
° 0
(The test for a H) p could use the stringent condition that there is some s with
1 ° 0¡1 1
a Ã¡ s ¡! p, and still be correct; moreover, all s ¡! a with d.a; s/ ¸ d.a; p/=° 0
° 0¡1
could be tested for s ¡! p within the desired time bounds. This might give a heuristic
improvement.)
° 0
This algorithm updates the subset of H) that is promised to be maintained.
Lemma 20 (Correctness of Algorithm M.2). The above algorithm finds all a such that
there are s and c with
® ° <1
a Ã¡ s Ã¡ c ¡! p;
Case I: a Ã®¡ s Ã°¡ c ¡<!1 p. Here the site g 2 R with c ¡!1 g has g H° )0 a known,
by induction. Moreover, d.c; a/ · .° 0 ¡ 1/d.c; g/ by Lemma 16, and so d.g; a/ ·
° 0
° 0d.c; g/ by Lemma 15. Hence the conditions under which a is examined for p H) a
° 0
are necessary for that condition to hold. Also, if d.c; a/ · .° 0 ¡ 1/d.c; p/, then p H) a
using Lemma 15 and Mp ¸ d.c; p/.
Case II: a Ã1¡ s ¡!° c ¡!® p. By induction, any g 2 R with c ¡!® g has a H° )0 g
®
recorded. By construction, there is some gO 2 GO examined with c ¡! gO; also, using
Lemma 17,
d.c/ · d.c; a/ · .1 C ° /d.s/;
and using the triangle inequality,
d.gO; p/ · 2®d.c/ · 2®.1 C ° /d.s/ · 2®.1 C ° /Ma:
° 0
Hence the conditions under which a is examined for a H) p are necessary.
Lemma 21. Algorithm M.2 requires O. A2.° 0C1// expected time to find all a 2 R with
° 0 ° 0
p H) a, and O. A° 00 jGO j/ to find all a 2 R with a H) p, where ° 00 ´ 2C° C3®.1C° /.
for some c and g, and have d.a; g/ · ° 0d.c/. By the triangle inequality, d.c; a/ ·
.° 0 C 1/d.c/, and so the first claim follows from Lemma 14.
° 0
Since a H) gO, d.a; gO/ · ° 0 Ma, and since d. p; gO/ · 2®.1 C ° /Ma, it follows that,
for s realizing Ma,
d.s; p/ · d.s; a/ C d.a; gO/ C d.gO; p/ · .1 C ° 0 C 2®.1 C ° //d.s; a/;
and the second claim follows from simplifying this expression, and applying Lemmas 17
and 13.
1 ° 0
As discussed in Section 4.5, the relations a Ã¡ s and a H) b are represented using
heaps for a 2 R. When p is added, the heaps Hp¤ are created and some heaps Ha¤ are
updated to reflect relations involving p. First, a bound on the cost of this operation.
Lemma 22. The expected cost of adding relations s ¡!1 p, a H° )0 p, or p H° )0 a is
A2.° 0C1/ O.log n/.
° 0
The relations H) continue to hold when p is added, except possibly for g 2 R with
1 <1
g Ã¡ c ¡! p for some c 2 S. We next look at what heap maintenance must be done
to reflect this.
The entry for c in HgÃ can be deleted in O.log n/ time.
° 0
The heaps Hg) and Ha( have entries a with g H) a, and this relation may become
false when p is added. When Hg) is used in Algorithm M.2, the a examined are close to
° 0
g, and the analysis of Lemma 21 holds even when g H) a is false. Hence, no updating
of Hg) need be done.
Finally, a change in Ha( in Algorithm M.2 must be considered. The approach we
take is to do nothing to update Ha(, but instead to verify relations when they are used.
° 0
That is, we verify the relation a H) f claimed by Hf ( in Algorithm M.1, and the
° 0
relation a H) gO claimed by HgO( in Algorithm M.2. If the key for a in Hf ( is still Ma,
then no change in Hf ( is needed. If Ma is currently smaller than that key value, but
still Ma ¸ d.a; gO/=° 0, then the relation still holds, but the key value must be updated
and its heap position changed. If Ma < d.a; gO/=° 0, then a is deleted from HgO(. Similar
maintenance can be done for Hf ( in Algorithm M.1.
Lemma 23. Verifying and updating the relations in heaps Hf ( and HgO( needs
O. A° 00 jGO j log n/ expected time. Here ° 00 is defined as in Lemma 21.
° 0
This ignores the work in deleting entries a for which a H) gO is no longer true.
However, this work is done once for each such relation, and therefore can be charged to
the insertion time for a in HgO(.
Proof. This is simply the O.log n/ work for each a examined by some HgO(. The time
to verify Hf ( is dominated by this.
4.9. Finding GO
Next we see that if a spherepacking bound holds, as discussed in Section 3.4, then the
set GO exists, and can be found efficiently. The algorithm is described within the proof of
the following lemma.
Lemma 24. Suppose .V ; d/ has a spherepacking bound. For p 2 S and R ½ S, let
®
S0 denote the set of c 2 S with c ¡! p with respect to R, where ® > 43 . Let
dl ´ min ©d.c/ j c 2 S0ª ;
dh ´ max ©d.c/ j c 2 S0ª :
Divide the interval .dl ; dh / into ranges
.®O k dl ; ®O kC1dl /
for k D 0 ¢ ¢ ¢ dlog3®=4 À. p; S0/e:
For each such k, consider the set Sk of c 2 S0 such that d.c/ 2 .®O k dl ; ®O kC1dl /. Construct
a packing Sk0 ½ Sk using Theorem 5, such that d.c; Sk0 / · ®O kC1dl =3 for all c 2 Sk . Since
d.c; p/ · ®d.c/ · ®®O kC1dl , the size of Sk0 need be at most S3®.
Now for each c 2 Sk , there is some c0 2 Sk0 with
d.c; c0/ · ®O kC1dl =3 · ®d.c/=4;
1
and so if c0 ¡! g0, then
Then
d.c; g0/ · d.c; c0/ C d.c0; g0/ · ®d.c/=4 C ®O kC1dl · ®d.c/=4 C 3®d.c/=4 D ®d.c/:
satisfies the conditions of the lemma.
Lemma 25. The algorithm for finding GO needs
GO ´
[ ng0 2 R j c0 ¡!1 g0; c0 2 Sk0 o
k
³ ´
O jGjjGO j D O
A®jGO j
expected time.
Proof. This is clear from the algorithm, and the bound on jGj is from the definition
of A®.
4.10. Concluding Analysis
Before considering the time complexity, we should note correctness.
Theorem 26. Algorithms M.1 and M.2 find the points for which a site p is ° nearest
in R.
Proof. This follows from Lemmas 18 and 20.
Theorem 27. The allnearest neighbors problem can be solved for a metric space with
a spherepacking bound in
expected time, for parameters ® and ° satisfying 43 < ® · ° .
Proof. The bounds given by Lemmas 19, 21–23, and 25 are dominated by
O. Amaxf° 00;2.° 0C1/gjGO j log n/.
Using the definition of A¯ , we obtain an expected bound on the order of
Nmaxf° 00;2.° 0C1/g;7 jGO j log n
for the cost of adding p to R of size i . The size of GO is O.S3®.log3®=4 7 // from Lemma 24.
Using this bound, the hypothesis that ® > 43 is fixed, and summing over i up to n gives
a total expected cost
O.n log2 n/Nmaxf° 00;2.° 0C1/g;7 S3® log 7:
Finally, from Lemma 7, the total expected cost is
or, neglecting constants, O.n.log n/2.log 7 /2/.
O.n log2 n/ ¡Smaxf° 00;2.° 0C1/gS3® log2 7 ¢ ;
Theorem 28. If Q and S are random subsets of Q [ S, and V has a spherepacking
bound, then M .S; Q/ can be built in
O.K n/.log n/2 .log 7 .S [ Q//2
expected time.
Proof. Apply the algorithm of this section to S [ Q K n, adding the elements of S first.
The resulting ° nearest neighbor information is exactly what is needed for the data
structure.
5. Data Structure D.S/
This section describes another approach to the post office problem, in this same setting,
but with some differences: the algorithm always returns the correct answer, and does
not require Q, but requires somewhat more space and query time. The algorithm can
be analyzed using the exchangeability assumption, where a query point q is a random
element of fqg [ S.
The algorithm uses a divideandconquer scheme apparently somewhat like Brin’s
[B].
In this section the values N°;7.R/ are considered for various subsets R of S. We use
the uniform bound N°;7.S/, abbreviated just as N° . Recall Lemma 7, that if .V ; d/ obeys
a spherepacking bound, then N°;7.S/ · S2° lg 7 .S/.
Preprocessing. Given a set of n sites S ½ V , the data structure D.S/ is built as follows:
take a random subset R of S. Recursively build the data structure D.R/ for R, and then
find, for each p 2 SnR, its nearest neighbor in R. Let Sa1 denote the sites in SnR that
have a 2 R as the nearest neighbor. For each a 2 R, build a list La of sites of R, in
nondecreasing order of their distance to a. For each c 2 Sa1, find the 3nearest neighbors
of c in R as follows: Walk down the list La, checking for each site b 2 R if b is 3nearest
to c in R. Stop this traversal of La when d.a; b/ > 4d.c; a/. For a 2 R, let Sa3 denote
the set of sites in S with a as the 3nearest neighbor. Recursively build D.Sa3/.
This completes the preprocessing. The data structure D.S/ comprises D.R/, the lists
La for a 2 R, and the recursively built data structures D.Sa3/ for a 2 R.
For convenience of analysis, the construction uses random subsets of size rk ´ n1=2kC1
at recursion depth k, and terminates the recursion at k D ln ´ dlg lg ne. (The construction
of D.S/ for the original input is at depth 0; when constructing D.S/ at depth k, the
construction of the data structures D.Sa3/ and D.R/ are at depth k C 1.) The recursion
bottoms out by simply storing the sites in a list, at recursion depth ln, or when the set
has fewer than some constant number of members.
Answering Queries. To find the nearest site to a point q, use the data structure D.R/
to find the nearest neighbor a of q in R. Next find the 3nearest neighbors of q in R as in
the preprocessing. For each b that is the 3nearest neighbor to q in R, recursively search
D.Sb3/ for the closest site in Sb3 to q. Return the closest site found by all such searches.
The recursion bottoms out by searching the list of sites in linear time.
There are a few correctness conditions to verify.
Theorem 29. The query and preprocessing algorithms find all 3nearest neighbors of
q and of sites in SnR . The query procedure returns the closest site in S to q.
Rather than store La as a sorted list, as described above, it is enough to store the entries
in a heap. With appropriate heap ordering relations, all members of La at a distance closer
than a value X can be found in constant time per reported member, using a recursive
traversal of the heap, just as for sorted lists. The heaps for sites a 2 R can be built in
constant time per entry, however, unlike sorted lists; this might be a bit faster in practice,
and simplifies the analysis.
Lemma 30. The expected size of a set of sites considered at recursion depth k is
O.N3/k n1=2k :
Proof. Suppose inductively that the expected size of the set W in the parent recursion, at
depth k ¡1, is m D O.N3/k¡1n1=2k¡1 , as is plainly true for k D 1. One branch of recursion
considers R ½ W of size n1=2k , which satisfies the claim. Other branches of recursion
apply to sets Wa3, which have expected sizes O.N3/m=rk¡1 using Lemma 10. (Note that
the expectation for Wa3 is with respect to the random choice of R, while the expectation
for jW j is from random choices at lesser recursion depth; hence the expectations can be
“mixed” in this fashion.) Since rk¡1 ´ n1=2k , the result follows.
Theorem 31. Suppose metric space .V ; d/ has a ° nearest neighbor bound, and S ½
V is a set of n sites. Suppose q is a random element of fqg [ S. Then the expected time
to find the closest site in S to q using D.S/ is
N5.lg n/O.1/C2 lg N3
Proof. Consider the query procedure for a set T at depth k in the recursion. Let Z .k/
denote the expected time for the query procedure applied to a set at depth k, so that
the expected time overall is Z .0/. Let R be the random subset of T used at that step.
The set T is a random subset of S, subject to the 3nearest neighbor conditions used in
passing down the recursion. Since q is a random element of fqg [ S, and subject to the
same 3nearest neighbor conditions (or else T would not be considered for q), q is a
random element of fqg [ R. Thus the expected time to search D.R/ is Z .k C 1/. Using
the triangle inequality, each site b examined in finding 3nearest neighbors of q satisfies
d.b; q/ · d.b; a/ C d.a; q/ · 5d.a; q/:
From Lemma 11, the expected work in examining such b is O.N5/. The expected work
done recursively can be expressed as the sum, over b 2 T , of Z .k C 1/, times the
probability that b is in R and is a 3nearest neighbor of q. The correctness of this claim
requires the observation that q is a random element of fqg [ Tb3. The probability that b
is a 3nearest neighbor of q in R is O.N3/=jT j, using Lemma 11. The expected query
time is therefore
for k smaller than the recursion depth ln and jT j large. From Lemma 30, the expected
size of the sets T at the bottom level is O.N3/ln , and this bounds the expected cost at the
bottom level of recursion. The overall expected cost is
O.N5/O.N3 C 1/ln O.N3/ln ;
which implies the expected time bound claimed.
Theorem 32. Suppose metric space .V ; d/ has a ° nearest neighbor bound, and S ½
V is a set of n sites. Then the expected time to build D.S/ is
O.n/N5.lg n/O.1/C2 lg N3
Proof. The general step of the algorithm builds a data structure for a set T at recursion
depth k. Let Y .k/ denote the expected time needed to build the data structure for such a
set T . The expected cost of building D.R/ is Y .k C 1/; the expected cost of finding the
nearest neighbors in R of each site can be bounded at Z .k C 1/ per site; the expected
cost of finding 3nearest neighbors in R of each site is O.rk2/ C O.N5/jT j, since each
b examined in a list La for site c 2 Ta1 is a 5nearest neighbor to c; the number of such
neighbors is bounded using Lemma 11, averaging over all c 2 T nR. Construction of
the heaps La, for a 2 R, requires O.rk2/ D O.n1=2k / altogether. Finally, the work in
building the data structures D.Tb3/ requires time proportional to the sum, over b 2 T , of
the probability rk =jT j that b 2 R ½ T , times the expected work Y .k C 1/ for Tb3. Thus
the work Y .k/ satisfies
Y .k/ · Y .k C 1/ C Z .k C 1/jT j C O.N5/jT j C O.n1=2k / C rk Y .k C 1/;
or using the bound E jT j · O.N3/k n1=2k of Lemma 30,
Y .k/ · .Z .k C 1/ C O.N5// O.N3/k n1=2k C O.n1=2k / C .n1=2kC1 C 1/Y .k C 1/;
with Y .ln/ · O.N3/ln , giving the result claimed.
Theorem 33. Suppose metric space .V ; d/ has a ° nearest neighbor bound, and S ½
V is a set of n sites. Then the expected space for D.S/ is
O.n/.lg n/O.1/Clg N3
5.1. Remarks
Proof. The total space requires for all lists La at each depth k is O.n/; the total expected
space required for sets at the bottom of the recursion is n O.N3/ln , with ln ´ dlg lg ne,
and this gives the bound.
The algorithm recursively builds data structures for sets Sb3, the sites having b as the
3nearest neighbor in R, rather than for Sb1, those having b as the nearest neighbor. The
query procedure would still be correct if the latter procedure were followed, and the
space required would be O.n/, since the sites would be partitioned at each step. So far,
it has not been clear how to analyze such an attractive algorithm.
It is a bit artificial to have a fixed schedule of sample sizes, rk ´ n1=2kC1 ; a more
natural sample size would be simply to use the square root of the set size jT j. This seems
to be more difficult to analyze, and still seems artificial: why not use a constant sample
size, or n=2? However, a smaller sample size would have an even bigger blowup in cost,
due to the imperfect nature of the divideandconquer scheme used. A larger sample size
runs into the apparent need for Ä.r 2/ space, to hold the lists La that are used to find
3nearest neighbors. While the choice of r is not tuned to be best possible, it seems close
to the only alternative with the ideas available.
In <k space with an L p norm, the expected query time takes the form
.lg n/O.k/Clg lg 7 :
5.2. ° Queries and Inverse Queries
by N3C2° , and N3 by N1C2° .
The data structure can be extended in two ways that are sometimes useful. The first
extension is to allow finding all ° nearest neighbors of a query point, not just the nearest
neighbors. The other extension is to allow inverse queries, for a point p and subset
R ½ S, that return sites z 2 SnR for which p is ° nearest to z in R.
Here we only sketch the straightforward extension to build a data structure D° .S/ that
allows ° nearest neighbor queries; the change is basically to replace finding 3neighbors
by finding .1 C 2° /neighbors, to maintain Sa1C2° instead of Sa3, to search lists only when
d.a; b/ · .2 C 2° /d.b; q/ instead of a multiplier of 4, and to replace N5 in the analysis
Theorem 34. Suppose metric space .V ; d/ has a ° nearest neighbor bound, and S ½
V is a set of n sites. The expected time to build D° .S/ is
as n ! 1, and the expected query time is
Here and
O.n/ A.lg n/O.1/C2 lg B
A.lg n/O.1/C2 lg B :
A D N3C2°;7.S/
B D N1C2°;7.S/;
which are O.log 7 .S// as a function of S when .V ; d/ has a spherepacking bound.
Given random R ½ S, it is possible to build a data structure for inverse neighbor
queries: given point p, quickly find all q 2 SnR such that p is a ° nearest neighbor to
q in f pg [ R.
One approach is the following: build the data structure D1C° .R/, and use it to find
the nearest neighbor, and all .1 C ° /nearest neighbors, in R of all q 2 SnR. For each
c 2 R, find the sets Sc1C° of q that have c as .1 C ° /nearest neighbor in R, and the
analogous set Sc1. Store these sets in heaps, so that all q in them with d.q; c/ greater than
some value can be found in constant time per Q.
With this preprocessing, for given p, find its .1C° /nearest neighbors in R, including
its nearest neighbor a. Examine all q 2 Sa1C° that have d.a; q/ ¸ d.a; p/=.1 C ° /, and
check if they have p as ° nearest. For each b 2 R that is .1 C ° /nearest to p in R,
examine the set Sb1 of q that have b as nearest in R, and check all such q that have
d.b; q/ ¸ d.a; p/ and d.b; q/ ¸ d.b; p/=.1 C ° /, to see if p is a .1 C ° /nearest
neighbor.
This completes the query procedure.
Theorem 35. The inverse neighbor procedure finds all q with p ° nearest in f pg [ R.
Proof. In the procedure, suppose a is nearest to p, b is nearest to q, and p is a ° nearest
neighbor of q. If d.a; p/ ¸ d.b; q/, then p has b as .1 C ° /nearest, using the triangle
inequality, while if d.a; p/ · d.b; q/, then q has a as .1 C ° /nearest. Moreover, if
d. p; q/ · ° d.b; q/, then d. p; b/ · d. p; q/ C d.q; b/ · .1 C ° /d.b; q/, and similar
reasoning applies to q 2 Ta1C° with d. p; q/ · ° d.a; q/.
Theorem 36. Suppose metric space .V ; d/ has a ° nearest neighbor bound, and S ½
V is a set of n sites. Then the preprocessing time for the above data structure for inverse
° nearest neighbor queries is the same as for .1 C ° /neighbor queries, and if p is
random in f pg [ S, then the expected query time is the time for a .1 C ° /neighbor query,
plus O.N1C°;7.S// C O.N2C°;7.S//jSj=jRj.
Proof. Every b examined is a .1 C ° /nearest neighbor of p, and the expected number
of such b is bounded by Lemma 11. Every q examined has p as a .2 C ° /nearest
neighbor, and is examined at most twice, once as a member of Tb1 and once as a member
of Sa1C° . The result follows from Lemma 10.
Conclusions
This paper has shown that nearest neighbor problems can be solved efficiently, even
when no information about the distance measure is given explicitly. Versions of some of
the algorithms given here have been implemented, with promising results which will be
reported elsewhere.
It is worth noting that the failure probability analysis and query time analysis of
M .S; Q/ do not depend on the triangle inequality, but only the spherepacking properties,
and there could be .V ; d/ which is not even a metric space, but for which these results
hold, or for which the algorithms are effective in practice.
While it is satisfying that the preprocessing for M .S; Q/ is nearly linear, the time
bound is higher than one would like, and, worse, the space for the construction degrades
to Ä.n2/ in high dimension. Is there a different construction without this flaw?
Acknowledgments References
I would like to thank the anonymous referees for their thoughtful comments.
[B] S. Brin. Near neighbor search in large metric spaces. In Proc. 21st Internat. Conf. on Very Large Data
Bases, pages 574–584, 1995.
[C1] K. L. Clarkson. Fast algorithms for the all nearest neighbors problem. In Proc. 24th Ann. IEEE Symp.
on Foundations of Computer Science, pages 226–232, 1983.
[C2] K. L. Clarkson. A randomized algorithm for closestpoint queries. SIAM J. Comput., 17:830–847,
1988.
[C3] K. L. Clarkson . Randomized geometric algorithms . In D. Z. Du and F. K . Hwang, editors, Computing in Euclidean Geometry , pages 117  162 . Lecture Notes Series on Computing, volume 1 . World Scientific, Singapore, 1992 .
[FC] C. Faloutsos and V. Caede . Analysis of ndimensional quadtrees using the Hausdorff fractal dimension . In Proc. 22nd Internat. Conf. on Very Large Data Bases , 1996 .
[FS] C. D. Feustel and L. G. Shapiro . The nearest neighbor problem in an abstract metric space . Pattern Recognition Lett. , 1 : 125  128 , December 1982 .
[MR] R. Motwani and P. Raghavan . Deferred data structuring: querydriven preprocessing for geometric search problems . In Proc. 2nd Ann. ACM Symp. on Computational Geometry , pages 303  312 , 1986 .
[M] K. Mulmuley. Computational Geometry : An Introduction Through Randomized Algorithms . PrenticeHall, Englewood Cliffs, NJ, 1993 .
[U] J. K. Uhlmann . Satisfying general proximity/similarity queries with metric trees . Inform. Process. Lett. , 40 : 175  179 , 1991 .
[V] P. M. Vaidya . An O.n log n/ algorithm for the allnearestneighbors problem . Discrete Comput. Geom. , 4 : 101  115 , 1989 .
[Y] P. N. Yianilos . Data structures and algorithms for nearest neighbor search in general metric spaces . In Proc. 4th ACMSIAM Symp. on Discrete Algorithms , pages 311  321 , 1993 .