Locality sensitive hashing

this is another datastructure used for [[111.99_algo_googlePageRank]]


Idea :

Given a set of points with distances. Construct a hash function where all bins consist of cubes in such that::

  1. Points that are close to each other tend to end up in the same bin
  2. points that are far from each other then to end up in different bins

–> a simple implementation could be a simple mesh drawn across the dimension and then all points set into it. Now that may work in two dimensions, yet as soon as we have more dimensions, then we would have an exponential increase of bins to sort in.

We want to use LSH for approximate neighborhood problems for example this following one

(R,c)-neighborhood problem : Given a data set and a query point q: if there exists a point in the data set within distance R from q, return a point with distance at most cR from the query point.

We want to find the closest matching point within the set Radius of our query point.

We now have a given ==input==:

  • a point set in with eculidean distances - where d is high!
  • the parameters R and c of the (R,c)-neighborhood problem we want to solve

==The Hash function== Use a random projection to project the data points to some lower dimension space , where will be of the order

Put a grid of width w on this space, and use the grid cells as hash bins

the approach is a little similar to the previous downscaling of dimensions for [[]]

This scheme will lead to a low but positive probability of succeeding. We will then have to boost the probability by repeating this scheme L times.

Internally, the algorithm will use the following parameters, which we are going to determine in the theoretical analysis