Implementing the Pseudo Priority R-Tree (PR-Tree)A Toy Implementation for Calculating Nearest Neighbour on Points in the X-Y PlaneFinal project report for UVic's CSC 520 graduate course, "Analysis of Algorithms." April 18th, 2011 AbstractIn this report we develop a 2-dimensional implementation of Arge et Al.'s Priority R-Tree to index a collection of points in the X-Y plane. We query the index with a randomly chosen new point to discover its nearest neighbour. We consider this a toy implementation since the PR-Tree (and R-Tree's in general) are designed to index axis-aligned rectangles (with potential overlap) in higher-dimension. However, we believe the implementation presented here provides a good starting point for programmers interested in understanding and exploring this intriguing data structure. 1. Introduction and BackgroundTypical algorithms presented in undergraduate computer science curriculums often assume all data fits in memory. Of course this assumption fails in many real world systems. Very large data sets exist that are orders of magnitude larger than a single computer's memory, and the majority of data in these cases must reside on slower storage media, such as solid-state-flash, hard disk, optical disk, tape drive, or network drives. The performance disparities between main-memory and secondary media, and in some cases, even physical characteristics (e.g. platter rotation and geometry in a hard-disk; sequential nature of tape) motivate the need for specialized algorithms and data structures that best suit each medium.
Just as the B-Tree stores proximate values within the same block on disk, the R-Tree stores proximate shapes together. The R-Tree encodes spatial "nearness" via a single concept: the axis-aligned bounding rectangle.
The R-Tree makes no guarantees about the efficacy of the bounding. The only guarantee is that each node in the tree will contain no more than B rectangles, where B is the block-size, (number of rectangles stored in each leaf on disk). This overflow constraint is again similar to the B-Tree, where each node can contain no more than B values, but an important difference between the two data structures also emerges. With a B-Tree, the remedy for an overfull node is evident: split the node into two nodes through the median value. Whereas with the R-Tree, there is no obvious remedy. How should one best partition B+1 smaller rectangles into two larger bounding rectangles such that the partition minimizes future disk accesses? The PR-Tree, by considering each rectangle as a 4-dimensional point (xmin,ymin,xmax,ymax) in a kd-tree, is able to achieve a splitting optimum that guarantees good asymptotic performance (similar to the kd-tree) for subsequent accesses. Unfortunately this kd-tree technique only works during the bulk-loading phase, and is unsuitable for maintaining a dynamic index. Thus there is no support for adhoc additions and deletions after the initial construction of the PR-Tree. 2. Problem Definition
2.1 Nearest RectangleWe employ an heuristic for finding the closest point: the closest rectangle to our query q = (x,y) probably contains the closest point. Since an R-Tree can store only axis-aligned rectangles, calculating the distance between the query and all rectangles at the current level of the tree is straight-forward, as shown in Figures 3 and 4. Initially we consider the axis-aligned possibilities {(a,y), (c,y), (x,b), (x,d)} since these will be closer if they lie on the rectangle. If none of these points lay on the rectangle, as shown in Figure 3, we look at the corners. The code listing in Figure 5 shows the approach implemented in Java.
public class Rectangle { int a, b, c, d; public boolean containsPoint(long x, long y) { return a <= x && x <= b && c <= y && y <= d; } public final long squaredDistanceTo(long x, long y) { if (containsPoint(x,y)) { return 0; } long aSquared = (x-a)*(x-a); long bSquared = (y-b)*(y-b); long cSquared = (x-c)*(x-c); long dSquared = (y-d)*(y-d); // Try perpendicular edges: (x,d), (x,b), (a,y), (c,y). long min = Long.MAX_VALUE; if (containsPoint(x,d)) { min = Math.min(min, dSquared); } if (containsPoint(x,b)) { min = Math.min(min, bSquared); } if (containsPoint(a,y)) { min = Math.min(min, aSquared); } if (containsPoint(c,y)) { min = Math.min(min, cSquared); } if (min != Long.MAX_VALUE) { return min; } // short circuit // Try rectangle corners. min = Math.min(min, aSquared+bSquared); min = Math.min(min, aSquared+dSquared); min = Math.min(min, cSquared+bSquared); min = Math.min(min, cSquared+dSquared); return min; } }
Figure 5: A Java implementation to calculate the squared distance between a query point q = (x,y)
and a rectangle.
2.2 R-Tree Best-First SearchOur search algorithm, for traversing our PR-Tree, explores neighbouring rectangles from closest to farthest. This is a standard "best-first" greedy search based on the heuristic that the closest rectangle probably contains the closest point. For each rectangle explored, its contained points are loaded from disk, and a sequential scan is performed on these to determine the nearest point to the query point inside the rectangle. Throughout the traversal a 'best candidate' is maintained. We use the distance between the query point and the 'best known' candidate to prune our search tree at each level: should any unexplored rectangles lie outside this distance, we omit them from our search.Recall that in an R-Tree each leaf resides on disk and contains at most B points. Since B is relatively small compared to our data set, the sequential scans applied inside each leaf node have negligible cost compared to the cost of a disk access. More efficient algorithms to scan the points inside the leaves are unnecessary; in this way our R-Tree search is similar to a standard B-Tree search (B-Tree leaves usually store sorted lists of values, but this is unnecessary). Figures 6.1 through 6.4, below, illustrate a best-first search on the final level of an example R-Tree.
3. Implementation3.1 Bulk-LoadTo build the pseudo PR-Tree in the first place, the initial algorithm presented in [6] is straight-forward. As in [6], we employ a top-down approach. The lowest levels of the tree may contain sub-trees with fewer than 4 leaf-nodes; the search procedure should guard against this possibility. Algorithm to build Pseudo PR-Tree on n points in the x-y plane:
3.2 QueryOur toy implementation, a pseudo PR-Tree to index points on the x-y plane, differs slightly from the general R-Tree presented earlier in this report. In a proper R-Tree all leaves reside on the final level, whereas in the pseudo PR-Tree, each level contains 2d leaf nodes (d=2 in our case), and two sub-trees. Since the sub-trees are tightly enclosed by bounding rectangles, the best-first greedy search outlined before is still applicable, with one complication: only the leaf nodes perform sequential scans; the sub-trees employ recursive scans. At each level of the tree, a query may fall into one of two cases:
A Java implementation of our PR-Tree search algorithm is given, in Figure 9. public class PRTree implements Rectangle { /** * Returns a Comparator object that orders rectangles * based on their distance from the supplied (x,y) point. * * @param x coordinate to measure nearness from. * @param y coordinate to measure nearness from. * @return A Comparator that returns 1, 0, or -1 based on how two supplied * rectangles compare to each other. */ private static Comparator<Rectangle> closenessComparator( final int x, final int y ) { return new Comparator<Rectangle>() { public int compare(Rectangle r1, Rectangle r2) { if (r1 == r2) { return 0; } // null is always far away (bigger) if (r1 == null) { return 1; } if (r2 == null) { return -1; } long a = r1.squaredDistanceTo(x,y); long b = r2.squaredDistanceTo(x,y); return a > b ? 1 : a == b ? 0 : -1; } }; } /* Here is the PR-Tree implementation: Two sub-trees and four priority leaves. */ private Leaf[] priorityLeaves = new Leaf[4]; private PRTree2D prTree1; private PRTree2D prTree2; public Rectangle queryNearest(int x, int y) { Comparator<Rectangle> c = closenessComparator(x, y); return queryNearest(c, x, y); } public Rectangle queryNearest(Comparator c, int x, int y) { Rectangle[] currentLevel = { priorityLeaves[0], priorityLeaves[1], priorityLeaves[2], priorityLeaves[3], prTree1, prTree2 }; // Apply our heuristic: the closest rectangle probably // contains the closest point. Arrays.sort(currentLevel, c); Rectangle bestCandidate = null; long bestSquaredDistance = Long.MAX_VALUE; for (Rectangle r : currentLevel) { // Notice the pruning based on bestSquaredDistance. if (r != null && r.squaredDistanceTo(x,y) < bestSquaredDistance) { // This is a recursive call when (r instanceof PRTree), // otherwise it's a sequential scan of the leaf points. Rectangle candidate = r.queryNearest(c, x, y); // Should we update our bestCandidate? long d = candidate.squaredDistanceTo(x,y); if (d < bestSquaredDistance) { bestCandidate = candidate; bestSquaredDistance = d; } } } return bestCandidate; }
Figure 9: A Java implementation of a best-first greedy search to find the closest point to a query point q = (x,y)
in a PR-Tree.
4. Evaluation4.1 Correctness
4.2 PerformancePerformance numbers come from running a pure-memory implementation (no disk accesses) on an Intel Core i3 2.26ghz laptop with 8GB of ram. The geographic space was limited to a 300,000 by 300,000 2-d array of possible locations for each run. The times reported represent the best observed of 5 runs. We ran the Oracle Java 6 profiler to locate any performance bottlenecks in our implementation, and the majority of execution time was spent partitioning the data, especially the partitioning into halves. We used a partitioning algorithm based on heap-sort (averaging over 8 comparisons per item in our largest run), whereas an algorithm based on Floyd-Rivest's SELECT would likely speed up our implementation significantly (literature suggests it usually requires less than 2-comparisons per item).
|