One generalization of binary trees is the k-d tree, which stores k-dimensional data. Every internal node of a k-d tree indicates the dimension (1, 2, ...) and the value in that dimension that it discriminates by. An internal node has two children, storing data that less-than-or-equal and data that is greater than that value in that dimension. For example, if the node distinguishes on dimension 2, value 10.7, then one child is for data with y value less than or equal to 10.7, and the other child is for data with y value greater than 10.7. Leaf nodes represent a bucket containing no more than b elements of k-dimensional data. All data are found in the leaves.
For this assignment, k is 3; that is, we will only be dealing with three-dimensional data. Dimension numbers are therefore 1, 2, and 3.
There are several strategies for building k-d trees. The preprocessing method (1) accumulates all the data in an array, (2) finds the best dimension to discriminate on, namely, the one with the widest range, (3) finds the best value of that dimension to discriminate on, namely, the median value in that dimension. (4) separates the data into two subarrays based on that discriminant, (5) recurses on the subarrays. Recursion terminates when an array has size b or smaller. One can also devise online methods that add to existing trees.
Write a program in Smalltalk that (1) reads a list of 1000 3-dimensional data values, (2) builds a k-d tree with those values, with b set to 10, using the preprocessing method, You may use mean instead of median to find the best value to discriminate on. (3) reads an additional 10 3-dimensional data values, called probes, and for each probe, lists all the data values stored in the tree in the bucket where the probe would be found if it were in the tree.
Test your program both on your own data and on the data in
http://www.cs.uky.edu/~raphael/courses/CS450/asg.smalltalk.data.
You don't have to construct your program the way I did, but to give you a start, I describe my program here.
Array. One computes the range of values; the
other averages the values.
DataStore able to hold up to 1000 points. It uses three
arrays, each of size 1000. It has a method to initialize from a file, a method
(a "putter") to insert a point, a method (a "getter") to report the value of a
given point, a method to report the number of points, and a method to return a
quadruple: (dimension, discriminant, left, right) that subdivides the DataStore
instance into two new instances.
KDTree with a putter method to insert all the data from a
DataStore and a method to print the contents of the bucket determined by
searching for a point.
DataStore with the first 1000
points from the file, constructs a KDTree from that DataStore,
then repeatedly reads points and probes for them.
The Smalltalk you will use is called gst. It is available in the
Multilab and the CSLab.
You will need code to open a file, read tokens, and convert them to integers.
from a file. This code shows you how to read the first ten space-delimited
tokens from a file, convert them to integers, and print them.
|myTokens|
myTokens := (TokenStream new)
setStream: (FileStream open: 'asg.smalltalk.data' mode: 'r').
10 timesRepeat: [myTokens next asInteger printNl] .
!
This assignment is due at the start of class time on the day indicated
in the syllabus. See the syllabus for the late policy. Submit the assignment
by email to raphael @cs.uky.edu.