K Nearest Neighbor Classification Implementation and Tutorial

Requirements

To follow this article and tutorial you will need to refer to the source code of the implementation of K Nearest Neighbors. I have implemented this algorithm and made it publicly availabe in github. You can get a copy of the code here: https://github.com/mhs321/MLModels. The K Nearest Neighbor Model is in the NearestNeighbour.py file contained in the MLModels package.

In addition it is ideal to use jupyter notebook or IPython as well as relevant version of numpy installed in your environment.

What is K Nearest Neighbors?

K Nearest Neighbors is a non-parameter model that can be used for classification or regression. In this article a classification model was implemented, however, the idea can be extended to solve for regression problems.

The way that K Nearest Neighbor works is that first the model remembers all the features and outputs of the training data. Then when you want to predict the class of an unseen event the model looks at the k closest neighbors to the location of the unseen event and chooses the class that is most frequently occurring within that neighborhood.

How to measure distance?

Often in our day to day lives we measure distance using the Euclidean distance method also known as the L2 norm. This is simply calculated by using the below equation:

$$\|x\|_2 = (\|x_1\|^2+\|x_2\|^2+\cdots+\|x_n\|^2)^\frac{1}{2}\tag{1}$$

What the above formula does is it squares every x, sums it up and then at the end takes the square root to get the Euclidean distance of the vector. Note the x on the left side of the equal sign represents a vector with n elements.

In mathematics, and in the K Nearest Neighbor Model there are various ways to measure distance. The various measures of distances are used in Lp spaces and the various measures are named p-norms or Lp-norms. The general formula to measure p-norms are: 

$$\|x\|_p = (\|x_1\|^p+\|x_2\|^p+\cdots+\|x_n\|^p)^\frac{1}{p}\tag{2}$$

Now we know the various ways to measure distance, when running the algorithm you will need to make a choice of which distance to use. When predicting the algorithm will then measure the distance of the unseen element relative to all data in the training data. It will then choose the K nearest neighbors and find the class that is most commonly occuring in the K neighbors and classify the unseen event as the majority class.

Using the NearestNeighbour Model

If you have not done so get a copy of the source code from github and run a jupyter notebook / IPython terminal. 

For this tutorial I will use the CIFAR10 dataset. You can read more about the data and download it from here: https://www.cs.toronto.edu/~kriz/cifar.html.

For this example you will need to import the following packages:

import numpy as np
import MLModels
from MLModels.NearestNeighbour import NearestNeighbor
import data_utils as utils

The utils package includes some useful utility functions to download the CIFAR10 data into a ready to use format for this model. You can load the data via the following command:

xtr, ytr, xte, yte = utils.load_CIFAR10('data/cifar-10-batches-py/')

I will assume that you extracted the data into the data/cifar-10-batches-py/ directory.

Currently the xtr and xte variables are numpy arrays of 4 dimensions. In order to use the NearestNeighbor model this will need to be flattened into a 1-D array. 

Xtr_rows = xtr.reshape(xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072
Xte_rows = xte.reshape(xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072

The next step is optional, but the KNN model is very slow when it comes to prediction as it needs to calculate the distance of each unseen event for each training dataset. So as the size of the data increases the algorithm will take longer to run exponentially. In the following steps I randomly select 1000 data from the test dataset to be used for prediction.

idx = np.random.randint(9999, size=1000)
xte_sample = Xte_rows[idx,:]

Next we will instantiate the Nearest Neighbor model and train the model using the training dataset.

nn = NearestNeighbor()
nn.train(Xtr_rows, ytr)

Next we will predict the test data using the default parameters to the predict method:

y_pred = nn.predict(xte_sample)

The default behaviour of predict is to choose the closest neighbor as the class to classify the test data and to use the L1 norm distance. You can adjust these parameters by adding additional parameters n (to determine Lp distance to use) and k (to determine how many of the closest neighbors to select).

Next we can see the accuracy by the following command:

print(nn.accuracy(y_pred, yte_sample)

For the random 100 images I used when I ran the code I got an accuracy of 44% using the nearest neighbor and L1 distance.

This is not great, but is not bad either as there are 10 possible classes so random guessing would on average achieve an accuracy of 10%.

Next I tried using the L2 norm to see if that would change things:

y_pred = nn.predict(xte_sample, 2)

This returned an accuracy of 40% which was a decline from the previous attempt. 

How to improve?

What I hope became evident from the last section is that there are some parameters in the KNN model that will need to be fine tuned and optimized to achieve optimal results. The obvious one is the k parameter which looks at how many neighbors to use when classifying. The less you have the more flexible your model will be but the more variance it will have (Overfitting). The larger k is the less variance your model will have but more bias which represents underfitting the data.

At the end a trade off will need to be made trying to optimise bias and variance.

Conclusion

This concludes the tutorial on KNN model. Have a play around and see what values for n and p yields good results for yourself. Please leave your findings in the comments below.

Subscribe to our mailing list

* indicates required