Nearest Neighbor Search
Definition ::
Given a set of “objects” and their pairwise distances
Various nearest neighbor problems ::
- Find the closest pair of points among
- Find all pairs of points that have distance smaller than some threshold - all pairs similarity search for example - or that have distance in a certain range - range search
- Find the k nearest neighbors of a particular point
- Find the k nearest neighbors of all data points
- for example to build the k-nearest neighbor graph based on some objects etc
Applications ::
- Near duplicate detection - find entries in one - or several data bases that might be duplicates –> because they are rather close to each other
- A simplistic recommender system : simply recommend the item closest to the one the user just bought
- simple prediction of machine learning systems
Given there is some application within machine learning :
k nearest neighbor classifier ::
- Given pairs of objects with labels for the are the objects - say people, molecules, images etc - and the are the corresponding “class labels “ - say male, female, cow, chicken etc… That data is called training data and ought to be categorized etc
- We want to use this data to construct a rule that can predict the class labels for new, previously unseen objects - test data
- k-nearest neighbor classifier is a simple approach to this problem
- Assume we can measure distances between all objects
- Given any new object X, find the k closest training objects
- Use the majority vote among their labels as the label Y for the new data point
![[Pasted image 20230127102846.png]] With given diagram we set a new point we ought to label accordingly. Once within a field of already existing data sets, we have to evaluate how to label them too. –> we search in a given radius and based on the majority of labels we categorize our new data point according to them.
abstract observation for nearest neighbor problems.
Different flaovr of neighbor problems :
- ALl nearest neighbors vs a single nearest neighbor
- Few queries vs many queries
- If the set it is fixed and we have to answer many nearest neighbor queries, then we can afford expensive pre-processing
- If we just want to evaluate the nearest neighbor relationships once - for example, to build a kNN graph - then we dont want to use expensive preprocessing
- Exact vs. approximate nearest neigbors
- Which metric are we utilising ? –> euclidean space ?
Closest pair of Points ::
Problem description :
![[Pasted image 20230127103406.png]] Given n points in the plane of . What is the closest pair of points ?
This problem is one of the primitives in Computational Geometry - that is, a basic subproblem that occurs over and over again when solving more complex problems –> checking ranges of nearest points, detecting overlapping space with points etc.. ..
Bruteforce Solution ::
Compute the distance between all pairs of poitns and take the minimum –> problem, we ought to calculate a lot! as the worst case runtime, not good .
Closest pair of points on a line ::
To abstract [[#Closest pair of Points ::]] we can observe everything as a line. If our points were all on one line, we could simply sort the points in time and then compute the distances between the neighbors on the line This would take overall Yet in higher dimensional spaces, this wont work anymore –> we would have to sort across a plane.
Divide and conquer approach ::
Idea ::
- Divide the space in left and right such that htere are the same number of points on both sides.
- Then recursively solve the problem on the left side an the right side.
- Eithe, the closest pair of points is on one of the sides. THen we are done and can return and merge.
- Or the closest pair is “across” the boundary. Then we have to do something clever when recombining the solutions
A possibility would be to span the shortest-found-distance as a tube about the dividing line. Within that tube we then could find a better solution of points within this small range. If we have found something, then we can set those to the closes pairs. However we have the problem of stacking points within the tube on one side –> that will be detected as smaller pairs, yet they are not finding the correct, expected pairs between two sets.
Storing the points at first ::
Let us first discuss how we “store” the points.
- We first sort all points according to their x-coordinates
- We store the list of x-coordinates in this order
- We do the same thing for the y-coordinates
- When separating the points into “left” and “right” we simply generate lists
So separating the whole structure is rather easy. One seperation step can easily be done in time of we just copy the arrays - note that to find the points left and right is easy in constant time, as we split right in the middle, but we need to copy the arrays later anyway to construct yet another one,
Solving the problem of smallest pairs that are divided into left/right side
- Assume that the closest pairs on the left and right sides have distances at least
- Then overall the closest pair can only be “across” the boundary if there exists a point within a -strip to the left and the right of boundary
Now we put all point that are in the -region in a new array that is sorted by y-values . ![[Pasted image 20230127111908.png]]
Problem persists, that further points stacked in x axis for example are within the delta range and have to be observed, although we are sure they are not necessary within –> cus if they were, then we would’ve found them before and thus didnt need to observe them anymore.
We could construct a circle around all the points within yet that is also time consuming and we are messing around with circles- which is awful.
==Solution==
Consider a point within the Array . We need to look at all points that have ==at most== distance in y-direction. Not that all pairs of points on the same side have a distance of at least –> otherwise we would’ve found them already.
HOw many such points in the “square above p” exist?
![[Pasted image 20230127112241.png]]
To observe we can place at most 4 points within a square of distance
So our idea is to position a given set of those squares on both sides - left and right..
All in all we only need to consider at most 7 points above and 7 points below the chosen point. ![[Pasted image 20230127112341.png]]
==To find out whether== p has a neighbor that is closer than , we simply compute the distance of p to the 7 points before and after p in Array
![[Pasted image 20230127112721.png]]
not that in we do not konw whether a point is left or right
- In fact, we can only look that the 7 points behind p in the array - otherwise we make all computations twice when going through the list of points p.
Algorithm ::
- Given, a set of P of points in
- We first build the list and by sorting – for each dimension!
- These list are the input to the following closest pair algorithm ::
ClosestPair(Px , Py )
if |P| ≤ 3 # end of recursion
find closest pair by computing all distances
# Recursion: split in left and right part
Construct Px,left , Py ,left , Px,right , Py ,right
(l0, l1) = ClosestPair (Px,left , Py ,left ) # solve left part
(r0, r1) = ClosestPair (Px,right , Py ,right ) # solve right part
# Combine solutions:
δ = min{dist(l0, l1), dist(r0, r1)}
Build list Aδ
For each p ∈ Aδ , compute distance to next 7 points in Aδ
Let δ′ be the minimal distance achieved in this way, for points p0, q0
if δ′ < δ
Return (p0, q0)
else
Running time ::
- We first need to sort the points to build , this takes time
- Then we split each part into two equal sized subproblems
- To combine the two sub-solutions, we need to go through all points and check whether they are in the -stripe
- Note that theoretically, all n points could be in the =strip
- For all these points we then need to compute a constant = 7 number of distances
- Together this gives us an recursion of With the Masters theorem, this leads to
Caveats ::
In principle, the algorithm can be generalized to higher dimensions But if the dimension is large, it increasingly gets inefficient ::
- the number of points that have distance from each other and sit in the same little squares grows exponentially with the dimensions points to computer =
- So instead of just looking at 7 points ahead, we need to look ahead much further - eventually we have to look at all points within the array.
Randomized Solution in linear time ::
Instead of systematically traversing through the whole set of points by dividing them in uniform sizes at each step, we could try a more random approach :
The idea would be :
Idea :
- First select points at random and compute their pairwise distances by brute force - takes around time.
- From the whole set of distances, set to be the smallest of all of them –> only observing from the random notes we observe
- Now divide the space into a grid of square with side length - not that the number of grid cubes might be large with unfortunate point selection –> not guaranteed to be efficient!
- Call a square filled if it contains at least one of the n original points.
- We are aware that the closeest pair has a distance . Thus the closest pair of points has to lie in the same or in neighboring squares within our grid!
- To find the closest pair we have to do the following ::
- For every filled square, compute the closest pair among the points in this square by brute forcing
- For every pair of adjacent squares, compute the closest pair among all the points in the two squares by brute forcing too
Analysis ::
- Denote by the number of points in the k-th filled square.
- Then the cost to compare points within the square k are
- The cost to compare points in neighboring squares k and i are therefore
- ==One can show that the sum of all these costs is done in , with a high probability! Not guaranteed tho!==
The key for this analysis L analyze the grid width g
- g must not be too small : then we would need to sample more than points in the first step to achieve this grid width, and applying brute force on this sample would be beyond linear time again
- Yet g must not be too large either, because we use brute force inside the squares –> increases with larger squares One particular reason why the analysis is so complicated: the distances of the sampled points are not independent from each other – people realized that it is easier to sample distances instead of points, this made it way easier. But was disccovered later! –
As a result, we now have a randomized algorithm with the following properties ::
- The running time of this algorithm is with a high probability
- it always returns the correct result because of definitino
Loose ends : probable issues ::
The origianl paper by Rabin left one open problem :
how can we decide which of the squares are filled, without walking through all of them? Note that there might be much more than squares, which would then spoil the bound we found earlier.
It was fixed by Fortune / Hopcroft by implementing perfect hashing in 1979. Was further optimized by Diezfelbinger with the use of almost universal hashing
Takeaway ::
This algorithm is cool! Its simple, improves from to and relatively simple too