KD-Trees

part of [[111.08_algo_datastructures#Trees :: General definitions]]]

general idea ::

Kd-trees were invented in the paper “Multdimensional binary search trees used for associative searching. Communications of the ACM”

The first formal average case analysis for nearest neighbor search appeared in:

Friedman, J.H. … An algorithm for finding best matche in logarithmic expected time. ACM Transactions on Mathematical Software ~ 1977

THe name KD-Tree used to be an abbreviation of k-dimensional tree but by now one calls them KD-trees - and even d-dimensional KD trees

We strive for an efficient access to points in a subgregion of the space.

We construct a data structure that recursively splits the space into rectangular regions
Represent these regions by a tree
Points that are close to each other tend to be in neighboring regions

We are given a set of points located in $R^{d}$ Among all the coordinate axes of the $d$ dimensions, we find the one where the spread of the data is the largest - where spread ist the distance between maimal and minimal coordinate in this direction, this dimension. We are now constructing a hyperplane perpendicular to this direction such that the same number of points lies on the left and right side +-1 We are now splitting the data set into two parts according to the hyperplane we defined before. We can then call the procedure recursively with the two parts you obtained by the split – caused by the hyperplane

Wee continue splitting the sets into smaller set until we are left with one point per region. Because we divide each set by 2 at most, we can later decide whether a region is located in the left or right child of the uppermost split. If we continue we can map out where each region is located with the representation being a binary tree.

![[Pasted image 20230130142526.png]]

Example ::

![[Pasted image 20230130142641.png]] We are given the set of points in a two-dimensional plane We can now split the mass of points into two sets. –> given by the previous definition [[#Constructing a KD-Tree ::]] we are checking where the largest distance between points is located - either y or x plain. Here the largest spread is on the x plane thus we cut it into two sets horizontal.Next we could consider looking at the left split and decide again how to split it best. Denote that a,b and g are farthest away from each other. Thus we are splitting across the x plane – because the farthest spread occurs in y ![[Pasted image 20230130142955.png]]

![[Pasted image 20230130143722.png]]

We continue with this procedure until we have each point being located within a region alone. Simultaneously we created a binary-tree that we can query now. We know that searching inside this structure is $O (n l o g n)$

Implementing efficiently ::

In a pre-processing step, sort all points in d different ways: According to the $x_{1}$ -values, according to $x_{2}$ -values and so on… We are maintaining the corresponding indices in $d$ arrays. Overall this will take $O (d n l o g n)$ all together.

sorting the Points according to their coordinates for each selected dimension. Once we have done that its getting easier to decide where to cut a set because we can gather the information relatively fast. $O (1)$

Once we stored it all, it is easy to do the following operations ::

select the splitting dimension, we have to compute the range of a set of points along all dimensinos. For afixed dimension this may be $O (1)$ yet overall it scales with the dimension so : $⟶ O (d)$
Find the splitting point in the selected dimension $O (1)$ - just go to the element in the middle of the array
Divide the points into “left” and “right” $⟶) (d n)$ –> we have to update all the $d$ arrays every time!

Running time ::

Sorting according to coordinates within dimensions : $O (d n l o g n)$
For each individual split : $O (d n)$
Number of splitting operations : $O (l o g n)$
1. Intuitively :: not that we buiild a balanced tree! Recall that the height of balanced trees is always $lo g n$ Overall the cost is within $O (d n l o g n)$

we can observer that scaling is also dependant on the used dimension thus it could potentially get inefficient at some point.

Maintaining a KD-tree ::

What if we wanted to add or remove points from our system? we would have to update the tree accordingly –> we did this before![[111.08_algo_datastructures#Trees :: General definitions]]

When defining the operations naively, the tree might lose its balance –> maintain it pls ! We may tolerate this unbalanced state up to a certain point, but then we would need to rebalance it.

Would it be cheaper to balance after each modification or after a given amount of modifications ?

Variants of construction ::

Instead of strictly cutting between points, we could also allow setting a line through data points We could also alter the way to determine the dimension based on the spread of the data - we could utilise variance and other criterias.

Nearest neighbor search ::

We are given a set of points in d Dimension : $R^{d}$ We construct the KD-Tree once // at the beginning now we want to answer the nearest neighbor queries with the help of our newly constructed tree.

Consider a point we are observing, and we would like to find the nearest neighbour of our already constructed set of points. ![[Pasted image 20230130145706.png]] We traverse from the root to the section where our point would be located in. In this case it would be the right side of the fourth split. We know that G is in the same region, but is it the closest ?

Maybe we should construct a radius around the searching tree that we want to search in. We can now backtrack from the starting point g. We are jumping up to the left child/ right child of our current leaf. Then we look at all the possibilities that were produced by the intersection - in this case its cut s5 and then we are going to look at point d and point e.

We look at d that is not within the set radius –> we drop it as possibility

We look at e that is within the radius we set.

we checked if the left child is intersecting with the made up circle ==> no so we leave it out Now we continued with the right child, checked again whether its in a intersection. ==> it is, So we take a look whether the new point is better than the previous one

We selected g to be the better point before, now we maybe favor e by comparing them. Its a better point, so we select it to be the better solution. However we could potentially have a better point in an intersecting region - where h and f are located - so we ought to check them as well. ![[Pasted image 20230130151301.png]] We are going to traverse back through the tree and look at each intersection - node. If the line constructed to intersect is outside the radius we dont have to consider it, this makes it easier to traverse through and not calculating all of them.

Branch and Bound ::

This is a typical branch-and-bound procedure:

In principle we would have to visit the whole tree.
But based on the bounds on the solution we collected before - the distance to the current candidates - we can cut off whole branches because we know that the solution cannot be in this branch anymore!
In the end we have found the exact solution but might have need considerably less effort than if we would have searched the whole space before

The worst case of branch and bound methods is always to visit the while tree without being able to cut anything It is a straight forward example to explicitily construcht such exampels. In this case we ahve to compare our query point with all n data points. $O (d n)$ where n is dependant on the distance computations and d-dimensional space

Average case running time ::

Worst case is the analysis where we evaluate the worst possible case of all possible runtimes

Average case is trying to evaluate the average - middle - of all possibele computations. We have to define the probability of all input-sequences first, otherwise we cannot construct an average, define it at all.

We are going to argue : the average case running time to find the neighbor of a query point is $Ω (lo g n)$

Assumptions to make ::

for an average analysis, we need to make an assumption about the distribution of all possible inputs.
In this case we assume that all data points and the query point are drawn i.i.d – independent from each other – from some underlying probability distribution - such as uniform distribution or alternatives.

To do all the maths, we need to be more specific about the distribution to ensure that this distribution is not completely degenerate. Say, it has a continuous density function. We skip details

Its always important to have rectangles with roughly the same sizes. Because its difficult to establish this, it would be more beneficial to know the size of each rectangle – so we could skip large ones for example ..

Analysing average case running time ::

![[Pasted image 20230130152727.png]] How many distance comparisons do we have to make in average! As many as cells intersect the searching circle?

Intuitively The number of cells intersecting the search circle does not grow with n, it is a constant that just depends on the geometry of the underlying space - things as dimension, distribution of points, distance function used etc.

Our formal argument could be :

Compute the average size of the search circle that is the average distance between two points drawn from the given distribution - standard, but a bit technical if one has not seen such an argument before - we exploit the assumption on the probability distribution.

![[Pasted image 20230130153056.png]] We can circumscribe a square around the search circle

Compute the average volume of this square - easy, side length is given already
Compute average size of a cell in the tree - similar as above
Then estimate how many cells fit in the circle, on average - we assume that all cells are quadratic, not rectangular

Doing all the maths lead s to the following result: On average there are at most $O (G (d))$ such intersections, where $G (d)$ is the ratio between the volume of a unit cube and a unit ball in $R^{d}$ $G (d) = \frac{v o l _{c} u b e}{v o l _{ba ll} ( d )}$ In particular, this number does not depend on n, but it grows exponentially with d

Other applications of KD Trees :

Describe a query point by the data points in its cell

build the tree until there is a constant number c of points in each cell
Classification: query point gets the majority label in the cell
Regression : query point gets assigned the mean response value in the cell
Vector quantization : query point get replaced by a representative member of the cell - or its mean

Other application may be Machine Learning ::

Lets see we have a data set of patients that have cancer or similar. We have several data points describing their blood pressure at time X, blood sugar and many more.

Now we want to find a general method in order to find further patients - new ones - and decide whether they are also potentially gathering cancer at some point.

We have a KD-Tree already - everything is split into regions and we have circle and cross patients - healthy and ill. We insert our new patient now and let the region it was categorized in, decied where to put it. So we have a region of 3 healthy and 2 ill ppl. This region will decide whether to categorize the patient as ill or healthy later..

Variations of KD-Trees ::

With normal Kd-Trees

we can control the number of points in each rectangle – optimally uniform separation
Yet we cannot control the actual volume of a rectangle

Now we want to either control the one or other property – it is close to impossible to optimize for both.

One such approach might be Quad Trees

Quad Trees :

In a 2 dimensional space, keep on splitting rectangles into four cell. Each cell has a maxmum capacity available. When this maximum capacity is reached, the cell gets splitted further. Build a tree that follow this spatical decomposition.

–> here we control the size of the rectangles, but not the number of points in each rectangle

Navigating Nets

Cover Trees

Comparison trees - by luxburg ^^:

At each splitting step: randomly select two data points from your dat aset and use the hyperplane between the two points for splitting. ![[Pasted image 20230130154812.png]] Advantages works in a comparison-based framework where we do not know the distances between points, but can only compare distances - object i is closer to object j than object k

scattered-lenity