Random Projection trees

this is part of [[111.99_algo_nearestNeighborSearch]]

Idea :: Instead of [[111.99_algo_KDTrees|KD-Trees]] which are cutting / partitioning along a coordinate axes Random Projection trees are always cutting/ partitioning along a randomly chosen axes –> derivated from the 1-cirlce (Einheitskreis)

![[Pasted image 20230203111414.png]] above you have a separation according to a KD-Tree (left) and according to a random projection tree(right)

–> This can be of advantage when the underlying data is “low-dimensional” a lot of data points are located close. Consider a case where the data points are along a “one dimensional line”. A KD-Tree would be very inefficient in that situation ![[Pasted image 20230203111626.png]]

$⟹$ hence we can result that RP are probably better.

Intuitive theoretical statements, intuitively

Assume that the data points have been drawn rom some nice underlying probability measure on $π o n R^{d}$ Assume we build the RP-tree on this data set. Consider a query point that also comes from distribution $π$ We now use the “defeatist search” to find its neighbors: route the query to the appropriate leaf and then simply pick the best point in this leaf cell – by brute forcing – in this leaf cell - and ignore that there might be better points in some of the adjacent cells

–> Then with high probability, the defeatist search on the tree results in the correct nearest neighbor

Extract from paper about

Theorem 7: There is an asbolute constant $c_{0}$ for which the following holds. Suppose $π$ is doubling measure on $R^{d}$ of intrinsic dimension $d_{0} \geq 2$ Pick any query $q \in X$ and draw $x_{1} . \dots, x_{n}$ independently from $π$ Then with probability at least $1 - 3 δ$ over the choice of data.

Just mind about a probability distribution on $R^{d}$ where $d_{0}$ is the dimension of your space

$δ$ is mostly declaring a small percentage of failure –> indicator for it

For the RP tree with $n_{0} \geq c_{0} (3 k)^{d_{0}}$

We never split the tree so much that there is only a single point within a region –> computational complexity increased and we would have to search in neighbors!

PR(fails to return k nearest neighbors) with $\leq c_{0} k (d_{0} + l n n_{0}) (\frac{8 ma x ( k , \frac{l n 1}{δ}}{n _{0}})^{1/ d_{0}}$ What we can take away from this term

If we have a lot of points within a region $n_{0}$ then the probability to fail is rather slow –> because we have a lot of points within the region!

Summary :

They are a straight forward a

scattered-lenity

Random Projection trees

Intuitive theoretical statements, intuitively

Summary :