Select Page

# Chapter 2: K-Nearest Neighbours

Let’s build a K-Nearest Neighbours model from scratch.

First, we will define some generic `KNN` object. In the constructor, we pass three parameters:

• The number of neighbours being used to make predictions
• The distance measure we want to use
• Whether or not we want to use weighted distances
```from collections import Counter

import numpy as np

from utils.distances import euclidean

class KNN:

def __init__(self, k, distance=euclidean, weighted=False):

self.k = k
self.weighted = weighted  # Whether or not to use weighted distances
self.distance = distance
```

Now we will define the `fit` function, which is the function which describes how to train a model. For a K-Nearest Neighbours model, the training is rather simplistic. Indeed, all there needs to be done is to store the training instances as the model’s parameters.

```    def fit(self, X, y):

self.X_ = X
self.y_ = y

return self
```

Similarly, we can build an `update` function which will update the state of the model as more data points are provided for training. Training a model by feeding it data in a stream-like fashion is often referred to as online learning. Not all models allow for computationally efficient online learning, but K-Nearest Neighbours does.

```    def update(self, X, y):

self.X_ = np.concatenate((self.X_, X))
self.y_ = np.concatenate((self.y_, y))

return self
```

In order to make predictions, we also need to create a `predict` function. For a K-Nearest Neighbours model, a prediction is made in two steps:

• Find the K-nearest neighbours by computing their distances to the data point we want to predict
• Given these neighbours and their distances, compute the predicted output
```    def predict(self, X):

predictions = []

for x in X:

neighbours, distances = self._get_neighbours(x)

prediction = self._vote(neighbours, distances)

predictions.append(prediction)

return np.array(predictions)
```

Retrieving the neighbours can be done by calculating all pairwise distances between the data point and the data stored inside the state of the model. Once these distances are known, the K instances that have the shortest distance to the example are returned.

```    def _get_neighbours(self, x):

distances = np.array([self._distance(x, x_) for x_ in self.X_])
indices = np.argsort(distances)[:self.k]

return self.y_[indices], distances[indices]
```

In case we would like to use weighted distances, we need to compute the weights. By default, these weights are all set to 1 to make all instances equal. To weigh the instances, neighbours that are closer are typically favoured by given them a weight equal to 1 divided by their distance.

If neighbours have distance 0, since we can’t divide by zero, their weight is set to 1, and all other weights are set to 0. This is also how scikit-learn deals with this problem according to their source code.

```    def _get_weights(self, distances):

weights = np.ones_like(distances, dtype=float)

if self.weighted:
if any(distances == 0):
weights[distances != 0] = 0
else:
weights /= distances

return weights
```

The only function that we have yet to define is the `vote` function that is called in the `predict` function. Depending on the implementation of that function, K-Nearest Neighbours can be used for regression, classification, or even as a meta-learner.

# KNN for Regression

In order to use K-Nearest Neighbour for regression, the `vote` function is defined as the average of the neighbours. In case weighting is used, the `vote` function returns the weighted average, favouring closer instances.

```class KNN_Regressor(KNN):

def _vote(self, targets, distances):

weights = self._get_weights(distances)

return np.sum(weights * targets) / np.sum(weights)
```

# KNN for Classification

In the classification case, the vote function uses a majority voting scheme. If weighting is used, each neighbour has a different impact on the prediction.

```class KNN_Classifier(KNN):

def _vote(self, classes, distances):

weights = self._get_weights(distances)

prediction = None
max_weighted_frequency = 0

for c in classes:

weighted_frequency = np.sum(weights[classes == c])

if weighted_frequency &amp;gt; max_weighted_frequency:

prediction = c
max_weighted_frequency = weighted_frequency

return prediction
```