Random Projection trees

this is part of [[111.99_algo_nearestNeighborSearch]]


Idea :: Instead of [[111.99_algo_KDTrees|KD-Trees]] which are cutting / partitioning along a coordinate axes Random Projection trees are always cutting/ partitioning along a randomly chosen axes –> derivated from the 1-cirlce (Einheitskreis)

![[Pasted image 20230203111414.png]] above you have a separation according to a KD-Tree (left) and according to a random projection tree(right)

–> This can be of advantage when the underlying data is “low-dimensional” a lot of data points are located close. Consider a case where the data points are along a “one dimensional line”. A KD-Tree would be very inefficient in that situation ![[Pasted image 20230203111626.png]]

hence we can result that RP are probably better.

Intuitive theoretical statements, intuitively

Assume that the data points have been drawn rom some nice underlying probability measure on Assume we build the RP-tree on this data set. Consider a query point that also comes from distribution We now use the “defeatist search” to find its neighbors: route the query to the appropriate leaf and then simply pick the best point in this leaf cell – by brute forcing – in this leaf cell - and ignore that there might be better points in some of the adjacent cells

–> Then with high probability, the defeatist search on the tree results in the correct nearest neighbor

Extract from paper about

Theorem 7: There is an asbolute constant for which the following holds. Suppose is doubling measure on of intrinsic dimension Pick any query and draw independently from Then with probability at least over the choice of data.

Just mind about a probability distribution on where is the dimension of your space

is mostly declaring a small percentage of failure –> indicator for it

For the RP tree with

We never split the tree so much that there is only a single point within a region –> computational complexity increased and we would have to search in neighbors!

PR(fails to return k nearest neighbors) with What we can take away from this term

If we have a lot of points within a region then the probability to fail is rather slow –> because we have a lot of points within the region!

Summary :

They are a straight forward a