A kD tree (otherwise known as a kd-tree, or
k-Dimension tree)
is a data structure that is used for storing coordinates so
nearest-neighbour searches or range-searches are quick.
It is a binary-search tree
that is specialised for coordinate searching, and is useful
for answering questions such as which pub is closest to my
current location?
The two main differences between a kD tree and a
bst are
In the following sections we will look at how to build a kD tree, how to perform a nearest-neighbour search using kD tree and the Python source-code that implements the algorithms for building and searching a kD tree.
In this section we will introduce the algorithm for crating a kD tree in 2D space. In theory, a kD tree can be used with coordinates in any space (2D, 3D, 4D...), but I will restrict myself to 2D space as it is simpler. (Technically, a kD tree in 2D space is called a 2D-tree, but I find the name confusing.)
The algorithm for building a 2-dimensional kD tree is a
divide and conquer
algorithm, similar
to the algorithm for building a binary-search tree.
The following pseudocode, which is based
on the pseudocode by
Marner,
shows the algorithm to build a kD tree.
It differs from the algorithm to build a binary-search tree
because
tuple function build_kd_tree(int depth, set points): if points contains only one point: return that point as a leaf. if depth is even: Calculate the median x-value. Create a set of points (pointsLeft) that have x-values less than the median. Create a set of points (pointsRight) that have x-values greater than or equal to the median. else: Calculate the median y-value. Create a set of points (pointsLeft) that have y-values less than the median. Create a set of points (pointsRight) that have y-values greater than or equal to the median. treeLeft = build_kd_tree(depth + 1, pointsLeft) treeRight = build_kd_tree(depth + 1, pointsRight) return(median, treeLeft, treeRight)
The following section will illustrate how a kD tree can be
built using the points
[(6, 1), (5, 5), (9, 6), (3, 6), (4, 9), (4, 0),
(7, 9), (2, 9)] as an example.
This example will create a balance tree with all the leaves at
the same level.
In reality, a kD tree is often not balanced, internal nodes
may have a leaf and another internal node as children, or
internal nodes may only have a single child (which is an
internal node).
However, examples should never show reality.
The kD tree after the first step.
The kD tree algorithm begins by
sorting the list by the x-values
resulting the list
[(2, 9), (3, 6), (4, 9), (4, 0), (5, 5), (6, 1), (7, 9),
(9, 6)].
The list is then split in two based on the
median x-value.
In this case the median x-value is 5, so the two lists are
[(2, 9), (3, 6), (4, 9), (4, 0)] and
[(5, 5), (6, 1), (7, 9), (9, 6)].
We can now create the root node of the tree, which has the value
5 (the median x-value).
All coordinates that have an x-value less than 5 will be found
in the left-hand subtree, while all coordinates that have an
x-value greater to or equal with 5 will be found down the
right-hand subtree.
You may wish to think of the process of sorting and dividing the points as being similar to drawing a line along x=5 in the plot of the points. Everything to the left of the line goes into the left-hand subtree, and everything else goes into the right-hand subtree.
The next step of the algorithm is to sort the sublists
by the y-values.
The sorted left-hand sublist (which contains all
coordinates with x-values less than 5) looks like
[(4, 0), (3, 6), (2, 9), (4, 9)].
The list is then split in two based on the
median y-value.
For the left-hand sublist, the median value is 9, so the two
sublists are
[(4, 0), (3, 6)] and
[(2, 9), (4, 9)].
You can think of this as drawing a line along y=9 (up to x=5) in
the plot of the points.
Everything below the line goes into the left-hand subtree, while
all other points go into the right-hand subtree.
A similar process can be followed with the sublist that
contains all x-values greater-than and equal to 5, resulting in
a median y-value of 6 and two sublists of
[(6, 1), (5, 5)] and
[(9, 6), (7, 9)].
The final step for our example is to sort the four sublists by the x-values, and then split the four sublists in two ending up with our eight leaf-nodes. The resulting kD tree has three levels. The first divides the coordinates by the x-values, the second by the y-values, and the third by the x-values again.
Calculating the median value for a list has a very subtle
problem.
Consider the points [(2, 3), (2, 4), (4, 3)].
In this list are no points with an x-value less than the median
x-value (2).
Similarly there is no point with a y-value less than the median
y-value (3).
Because of this, we cannot partition the list in two using
either the x-values or y-values, so
the algorithm for building
the kD tree would recurse forever.
Greg Ewing points out that this problem can be solved by checking to see if the median x-value or y-value is larger than the smallest x-value or y-value respectively. If the median is not larger, add one to the median and carry on as normal.
A nearest-neighbour search is used to find which point in a set
is closest to another point.
It is used to answer such as which pub is closest to my
current location?
The following is the pseudocode for a nearest-neighbour search for a two-dimensional kD tree. It is quite similar to a search of a binary-search tree. The main difference is that we have to keep track of the depth in the tree, otherwise we will not know if we should compare the x-value of the point or the y-value of the point.
tuple function search_kd_tree(int depth, tuple tree, tuple point): if tree is a single point: return the point meanValue = the value of the root node of tree if depth is even: comparisonValue = the x-value of the point else: comparisonValue = the y-value of the point if comparisonValue < meanValue : subtree = left-hand subtree of tree else: subtree = right-hand subtree of tree return search_kd_tree(depth + 1, subtree, point)
The kD tree built in the earlier example.
For an example search, we will search
the kD tree we built in the
earlier example
for the point nearest to (5, 7).
First we compare the x-value of the point we are searching for with value of the root of the tree (5). The x-value is compared because the root of the tree is at depth zero, which we are treating as even. As the x-value of the point is equal to to the value of the root, we descend into the right-hand of the subtree.
Next we compare the y-value of the point we are searching for (7) with the value of the current node (6). As the y-value is greater we descend into the right-hand subtree of the current node.
In the final step we descend into the left subtree of the node
as the x-value of the point (5) is less than
the value of the current node (9).
As the subtree is a leaf we return the value of the point
(7, 9) which is the nearest neighbour to the point
we are searching for (5, 7).
Jason Smith pointed out that we do not have to keep track of the current depth in the tree when performing the search. Instead, we only compare the median with the first of the two search-coordinates, and then swap them during each recursion as follows.
tuple function search_kd_tree(int scalarA, int scalarB, tuple tree): if tree is a single point: return the point meanValue = the value of the root node of tree if scalarA < meanValue: subtree = left-hand subtree of tree else: subtree = right-hand subtree of tree return search_kd_tree(scalarB, scalarA, subtree)
The above algorithm is faster than the one above because it reduces the number of if-statements. It also uses less memory as we are no-longer keeping track of the current depth in the tree.
This software is licensed under the CC-GNU GPL.
The kD tree source code (10K Python code) is based on the algorithms for building the kD tree and performing a nearest-neighbour search. The code uses the true median to partition the points into two subsets, rather than estimating the median by adding the first and last values together and dividing by two. Using the true median can produce better-balanced trees so searching is faster, but calculating the true median requires that the list of points is constantly sorted and re-sorted, which is quite slow.
Knuth, Donald, The Art of Computer Programming, Volume Two, Third Edition, Addison-Wesley, Reading, Massachusetts, 1997.
Marner, Jacob, Computational Geometry, Web Page, May 2000.
Sedgewik, Robert, Algorithms, Second Edition, Addison-Wesley, Reading, Massachussetts, 1988.