Chapter 3: K-Medians

As we mentioned before, K-Means is only guaranteed to converge using Euclidean distances. Therefore, it is highly discouraged to use other distance measures.

If other distance measures must be used, the algorithm needs to be slightly tweaked. One example is K-Medians. K-Medians is a variant of the K-Means algorithm which allows one to use Manhattan distance instead of Euclidean.

import numpy as np
import copy

from unsupervised.kmeans import KMeans
from utils.distances import manhattan

class KMedians(KMeans):

    def _distance(self, a, b):

        return manhattan(a, b)

In order to ensure convergence, the way centroids are defined is slightly modified. In K-Medians, a centroid is defined as the median point in a cluster. This means that the centroid, in each dimension, is composed of the median value of all points in that cluster.

    def _compute_centroids(self, X, labels):

        centroids = []

        for i in range(self.k):

            centroid = np.array([np.median(dim) for dim in X[labels == i].T])

        return np.array(centroids)

Thankfully, because K-Medians and K-Means share so much of their logic, this is all that is required to implement K-Medians!