Instancebased learning
In this project, you will implement an efficient instancebased learner.
 Read (or skim) Section 2.2 of the book.
 Implement a knn instancebased learner. To help you get started, here is some inefficient code to
find the knearest neighbors of a point. (Disclaimer, this code has not been tested. You may need to
debug it.)
bool my_comparator(const pair<size_t,double>& a, const pair<size_t,double>& b)
{
return a.second < b.second;
}
void find_neighbors_by_brute_force(const Vec& point,
const Matrix& data, size_t k, vector<size_t>& outIndexes)
{
vector< pair<size_t,double> > index_dist;
for(size_t i = 0; i < data.rows(); i++)
{
double dist = compute_distance(point, data[i]);
index_dist.push_back( pair<size_t,double>(i, dist) );
}
std::sort(index_dist.begin(), index_dist.end(), my_comparator);
outIndexes.resize(k);
for(size_t i = 0; i < k; i++)
outIndexes[i] = index_dist[i].first;
}
For continuous values, use Mahalanobis distance.
For categorical values, use Hamming distance.
 Please make sure your instancebased learner passes the following unit tests:
First, train on this data,
@relation knn_train_features
@attribute x real
@attribute y real
@data
0.9,3
2,2.01
4,2
2,2.5
0,0
3,0
1,5
0,2
@relation knn_train_labels
@attribute class {a,b}
@data
a
a
a
b
a
a
a
a
then test on this data using k=3, and a weightexponent of 1 (linear weighting):
@relation knn_test_features
@attribute x real
@attribute y real
@data
2,3
2,4
0,0
@relation knn_test_labels
@attribute class {a,b}
@data
b
a
a
It should correctly classify all three points. If not, try doing it by hand on a piece of graph paper.
Then, see where your code does something different.
Here is another test. First, train on this data, (the last two columns are labels),
@relation knn_train_features
@attribute f1 {a,b}
@attribute f2 {a,b}
@attribute f3 {a,b}
@attribute f4 {a,b}
@attribute f5 {a,b}
@data
a,b,a,a,a
a,a,a,b,b
b,b,a,b,a
b,b,b,b,b
b,a,a,b,b
a,a,a,b,a
@relation knn_train_labels
@attribute l1 real
@attribute l2 real
@data
2,3
2.5,2.5
37,14
29,19
71,11
3,2
then test on this data with k=3,
@relation knn_test_features
@attribute f1 {a,b}
@attribute f2 {a,b}
@attribute f3 {a,b}
@attribute f4 {a,b}
@attribute f5 {a,b}
@data
a,a,a,a,a
The 3nearest neighbors should be {0,1,5}. The distances should be {1,2,1}. The predicted label vector should be {(1/1*2+1/2*2.5+1/1*3)/(1/1+1/2+1/1) = 2.5, (1/1*3+1/2*2.5+1/1*2)/(1/1+1/2+1/1) = 2.5}.
 Here is some medical data.
Train your knn instancebased learning using the training features and labels.
Classify the test features.
Report the number of misclassifications with 3 neighbors using linear weighting.
(The correct answer is greater than 0 and less than 100.)
A good way to make sure you do it right is to not even load the test labels until after you have made all of your predictions.
 Implement a kdtree that can be used to efficiently find the knearest neighbors of a vector.
Each leaf node in your tree should store the indexes (not the actual vectors) that it encloses.
Here are some example data structures that might be helpful for getting started:
class KdNode
{
virtual ~KdNode() {}
virtual bool isLeaf() = 0;
};
class KdNodeInterior : public KdNode
{
size_t column;
double value;
KdNode* less_than;
KdNode* greater_or_equal;
virtual ~KdNodeInterior()
{
delete(less_than);
delete(greater_or_equal);
}
virtual bool isLeaf() { return false; };
};
class KdNodeLeaf : public KdNode
{
vector<size_t> m_pointIndexes;
virtual ~KdNodeLeaf() {}
virtual bool isLeaf() { return true; };
};
class KdTree
{
Matrix m_points;
KdNode* m_pRoot;
KdTree(Matrix& points)
{
m_points.copy(points);
vector<size_t> indexes;
for(size_t i = 0; i < points.rows(); i++)
indexes.push_back(i);
m_pRoot = buildKdTree(indexes);
}
~KdTree()
{
delete(m_pRoot);
}
void buildKdTree(vector<size_t>& indexes);
void findNeighbors(size_t k, const Vec& point,
vector<size_t>& outNeighborIndexes);
};
Here is some pseudocode for building a kdtree:
Let P be a set of point indexes.
if P contains fewer than 8 (or so) point indexes, then
return a leaf node containing all of P.
else
Measure the mean and deviation in each dimension.
Let i be the dimension with the largest deviation.
Divide P into two sets by dividing at the mean in attribute i.
Recursively build the two child nodes.
return an interior node that joins those two child nodes.
Here is some pseudocode for finding neighbors using a kdtree:
Let p be a point for which we want to find the knearest neighbors.
Let s be an initially empty priority queue of point indexes, sorted by distance to p.
Let q be an initially empty priority queue of KdNodes, sorted by minimum distance to p.
Add the root node of the kdtree to q.
while q has at least one node in it, do:
pop n from q
if n is a leaf node, then:
add all the points in n to s
else
if the distance from p to n is greater than the k^{th} neighbor in s, then
break
add both child nodes of n to q
return the knearest points in s
A good way to debug your kd tree is to fill an nby2 matrix with random values. Next, find the
smallest value for n where your kd tree gives different results from the bruteforce method.
Finally, plot the random points on a piece of graph paper, then step through your code and see where it
does something wrong. After you get it working, crank up the value of n, until you have convinced
yourself that your kd tree is bulletproof.
 Measure and report how much speedup your kdtree gives relative to the brute force approach.
For Java, nanotime is good for measuring time. For C++, here is some code:
double seconds()
{
#ifdef WINDOWS
return (double)GetTickCount() * 1e3;
#else
struct timeval tp;
gettimeofday(&tp, NULL);
return ((double)tp.tv_sec + (double)tp.tv_usec * 1e6);
#endif
}
 I don't want to wait to see your code do it the slow bruteforce way,
so please turn in your code in a state that only uses the kdtree.
(Just print the measured speedup as hardcoded output.)
Also, please keep your output simple.
(I don't need to see all your debugging spew.)
Then, submit a zip file containing your code and the data it needs to work.
(Do not submit any generated binary files.)
Make sure your zip file contains a file named build.bash (or build.bat) that builds your code.
Q and A
 How does one measure the distance between a point and a KdNode?
Consider this figure:
The Kd tree represented in this figure contains 3 interior nodes (A, B, and C) and four leaf nodes (E,F,G, and H).
Suppose we want to find the distance between p and H.
In order to find H, we must traverse the tree from A>B>C>H.
Let v be a vector of size 2 (because there are two dimensions in this example). It stores the squared distance in each dimension.
Initialize v with zeros, such that v=[0, 0].
The distance from p to A is sqrt(0 + 0). This is correct because A encloses all points.
When we visit B, we update v to be [d1*d1, 0]. The distance from p to B is sqrt(d1*d1 + 0*0).
When we visit C, we update v to be [d1*d1, d2*d2]. The distance from p to C is sqrt(d1*d1 + d2*d2).
When we visit H, we update v to be [d3*d3, d2*d2]. The distance from p to H is sqrt(d3*d3 + d2*d2). (Since A and C both divide in the same dimension, We stomp over the old value in that dimension.)
The distance from p to E, should be 0. The distance from p to F should be d1.
 How does one measure the deviation of a categorical attribute?
Deviation is a measurement for continuous attributes.
A similar metric for categorical attributes is normalized entropy:
function computeEntropy(v)
Let k be the number of categories.
Let n be the number of samples.
double h = 0.0;
for i from 0 to k1
int c = 0;
for j from 0 to n1
if v[j] == i
c++;
h = c / n * log(c / n) / log(k);
return h;
Normalized entropy is not really the same thing as deviation, but it has several similar properties.
It always falls between 0 and 1, and larger values indicate that splitting on this attribute will provide more benefit for the Kdtree.
Wouldn't it be like comparing apples to oranges if we compare deviations of normalized attributes against normalized entropies?
Yes. However, if you choose to divide on a suboptimal attribute, that will only make the kdtree slower.
It should not make the kdtree yield incorrect results.
 When the Kdtree divides on a categorical attribute, do we need more than two child nodes?
No. For categorical attributes, I would compare for equality, not inequality.
Example: If value == "potato", then go to left child, else go to right child.
