kD Trees

A kD tree (otherwise known as a kd-tree, or k-Dimension tree) is a data structure that is used for storing coordinates so nearest-neighbour searches or range-searches are quick. It is a binary-search tree that is specialised for coordinate searching, and is useful for answering questions such as which pub is closest to my current location? The two main differences between a kD tree and a bst are

  1. Values are always stored in leaves, and
  2. What internal-nodes represent differs on the depth of the node in the tree.

In the following sections we will look at how to build a kD tree, how to perform a nearest-neighbour search using kD tree and the Python source-code that implements the algorithms for building and searching a kD tree.

Building a 2-Dimensional kD Tree

In this section we will introduce the algorithm for crating a kD tree in 2D space. In theory, a kD tree can be used with coordinates in any space (2D, 3D, 4D...), but I will restrict myself to 2D space as it is simpler. (Technically, a kD tree in 2D space is called a 2D-tree, but I find the name confusing.)

The algorithm for building a 2-dimensional kD tree is a divide and conquer algorithm, similar to the algorithm for building a binary-search tree. The following pseudocode, which is based on the pseudocode by Marner, shows the algorithm to build a kD tree. It differs from the algorithm to build a binary-search tree because either the x-value or the y-value is used to split the set of points in two, depending on the current depth in the tree.

tuple function build_kd_tree(int depth, set points):
    if points contains only one point:
        return that point as a leaf.

    if depth is even:
        Calculate the median x-value.
        Create a set of points (pointsLeft) that have x-values less than
            the median.
        Create a set of points (pointsRight) that have x-values greater
            than or equal to the median.
    else:
        Calculate the median y-value.
        Create a set of points (pointsLeft) that have y-values less than
            the median.
        Create a set of points (pointsRight) that have y-values greater
            than or equal to the median.
    
    treeLeft = build_kd_tree(depth + 1, pointsLeft)
    treeRight = build_kd_tree(depth + 1, pointsRight)

    return(median, treeLeft, treeRight)

Example

The points used in the kD tree example.

The following section will illustrate how a kD tree can be built using the points [(6, 1), (5, 5), (9, 6), (3, 6), (4, 9), (4, 0), (7, 9), (2, 9)] as an example. This example will create a balance tree with all the leaves at the same level. In reality, a kD tree is often not balanced, internal nodes may have a leaf and another internal node as children, or internal nodes may only have a single child (which is an internal node). However, examples should never show reality.

Step One

The kD tree after the first step.

The kD tree algorithm begins by sorting the list by the x-values resulting the list [(2, 9), (3, 6), (4, 9), (4, 0), (5, 5), (6, 1), (7, 9), (9, 6)].

The list is then split in two based on the median x-value. In this case the median x-value is 5, so the two lists are [(2, 9), (3, 6), (4, 9), (4, 0)] and [(5, 5), (6, 1), (7, 9), (9, 6)]. We can now create the root node of the tree, which has the value 5 (the median x-value). All coordinates that have an x-value less than 5 will be found in the left-hand subtree, while all coordinates that have an x-value greater to or equal with 5 will be found down the right-hand subtree.

You may wish to think of the process of sorting and dividing the points as being similar to drawing a line along x=5 in the plot of the points. Everything to the left of the line goes into the left-hand subtree, and everything else goes into the right-hand subtree.

Step Two

The kD tree after the second step.

The next step of the algorithm is to sort the sublists by the y-values. The sorted left-hand sublist (which contains all coordinates with x-values less than 5) looks like [(4, 0), (3, 6), (2, 9), (4, 9)]. The list is then split in two based on the median y-value. For the left-hand sublist, the median value is 9, so the two sublists are [(4, 0), (3, 6)] and [(2, 9), (4, 9)]. You can think of this as drawing a line along y=9 (up to x=5) in the plot of the points. Everything below the line goes into the left-hand subtree, while all other points go into the right-hand subtree.

A similar process can be followed with the sublist that contains all x-values greater-than and equal to 5, resulting in a median y-value of 6 and two sublists of [(6, 1), (5, 5)] and [(9, 6), (7, 9)].

Step Three

The final kD tree.

The final step for our example is to sort the four sublists by the x-values, and then split the four sublists in two ending up with our eight leaf-nodes. The resulting kD tree has three levels. The first divides the coordinates by the x-values, the second by the y-values, and the third by the x-values again.

A Problem With the Median

Calculating the median value for a list has a very subtle problem. Consider the points [(2, 3), (2, 4), (4, 3)]. In this list are no points with an x-value less than the median x-value (2). Similarly there is no point with a y-value less than the median y-value (3). Because of this, we cannot partition the list in two using either the x-values or y-values, so the algorithm for building the kD tree would recurse forever.

Greg Ewing points out that this problem can be solved by checking to see if the median x-value or y-value is larger than the smallest x-value or y-value respectively. If the median is not larger, add one to the median and carry on as normal.

Nearest-Neighbour Search

A nearest-neighbour search is used to find which point in a set is closest to another point. It is used to answer such as which pub is closest to my current location?

The following is the pseudocode for a nearest-neighbour search for a two-dimensional kD tree. It is quite similar to a search of a binary-search tree. The main difference is that we have to keep track of the depth in the tree, otherwise we will not know if we should compare the x-value of the point or the y-value of the point.

tuple function search_kd_tree(int depth, tuple tree, tuple point):
    if tree is a single point:
        return the point

    meanValue = the value of the root node of tree

    if depth is even:
        comparisonValue = the x-value of the point
    else:
        comparisonValue = the y-value of the point

    if comparisonValue < meanValue :
        subtree = left-hand subtree of tree
    else:
        subtree = right-hand subtree of tree

    return search_kd_tree(depth + 1, subtree, point)
    

Example

The kD tree built in the earlier example.

For an example search, we will search the kD tree we built in the earlier example for the point nearest to (5, 7).

Step One

The search of the kD tree after the first step.

First we compare the x-value of the point we are searching for with value of the root of the tree (5). The x-value is compared because the root of the tree is at depth zero, which we are treating as even. As the x-value of the point is equal to to the value of the root, we descend into the right-hand of the subtree.

Step Two

The search of the kD tree after the second step.

Next we compare the y-value of the point we are searching for (7) with the value of the current node (6). As the y-value is greater we descend into the right-hand subtree of the current node.

Step Thee

The search of the kD tree after the third step.

In the final step we descend into the left subtree of the node as the x-value of the point (5) is less than the value of the current node (9). As the subtree is a leaf we return the value of the point (7, 9) which is the nearest neighbour to the point we are searching for (5, 7).

An Optimised Search Algorithm

Jason Smith pointed out that we do not have to keep track of the current depth in the tree when performing the search. Instead, we only compare the median with the first of the two search-coordinates, and then swap them during each recursion as follows.

tuple function search_kd_tree(int scalarA, int scalarB, tuple tree):
    if tree is a single point:
        return the point

    meanValue = the value of the root node of tree

    if scalarA < meanValue:
        subtree = left-hand subtree of tree
    else:
        subtree = right-hand subtree of tree

    return search_kd_tree(scalarB, scalarA, subtree)

The above algorithm is faster than the one above because it reduces the number of if-statements. It also uses less memory as we are no-longer keeping track of the current depth in the tree.

Source Code

CC-GNU GPL

This software is licensed under the CC-GNU GPL.

The kD tree source code (10K Python code) is based on the algorithms for building the kD tree and performing a nearest-neighbour search. The code uses the true median to partition the points into two subsets, rather than estimating the median by adding the first and last values together and dividing by two. Using the true median can produce better-balanced trees so searching is faster, but calculating the true median requires that the list of points is constantly sorted and re-sorted, which is quite slow.

Bibliography

Knuth, Donald, The Art of Computer Programming, Volume Two, Third Edition, Addison-Wesley, Reading, Massachusetts, 1997.

Marner, Jacob, Computational Geometry, Web Page, May 2000.

Sedgewik, Robert, Algorithms, Second Edition, Addison-Wesley, Reading, Massachussetts, 1988.