Chapter 6: Evaluating Clustering

Machine learning is all about getting better and better at a task. Therefore, we need to define what it means to be good. In the case of clustering, this is not always straightforward. There are two main types of evaluations that can be done. Namely, whether or not the desired clustering is known or not. If the desired clustering is known, a measure of closeness can be calculated, if it is not known, a measure of consistency within clusters has to be computed.

Silhouette Index

The Silhouette index is a famous metric that is often used in order to know how consistent a clustering is. For each data point, the inner-cluster distance and the neighbour-cluster distance is computed. Intuitively, it calculates how well each data point fits within their own cluster, versus how well they would fit in the closest cluster it is not part of. The score can sit between -1 and 1. A score of 1 is desired. If the score is negative, it typically means that the clustering could be improved by moving certain data points to their neighbour cluster.

def silhouette_index(X, y, distance):

    def inter_cluster_distance(sample_id):

        other_sample_ids = np.array([i for i in range(len(X)) if i != sample_id])
        cluster_ids = y[other_sample_ids]
        cluster_id = y[sample_id]
        x = X[sample_id]

        cluster = X[other_sample_ids][cluster_ids == cluster_id]

        if len(cluster) == 0:
            return 0
        else:
            distances = cdist([x], cluster, distance)
            return np.mean(distances)

    def nearest_cluster_distance(sample_id):

        x = X[sample_id]
        dist = float('inf')

        for other_cluster_id in set(y[y != y[sample_id]]):

            cluster = X[y == other_cluster_id]

            new_dist = np.mean(cdist([x], cluster, distance))

            dist = new_dist if new_dist < dist else dist

        return dist

    s = []

    for i, x in enumerate(X):

        a = inter_cluster_distance(i)
        b = nearest_cluster_distance(i)

        s.append((b - a) / max(a, b))

    return np.mean(s)

Rand Index

If there exists some gold standard clustering, one can measure how close we are to it. The Rand index is a measure that achieves just that. It is related to the accuracy measure for classification. There also exists a version of the Rand Index which accounts for chance.

def rand_index(X, y):

    a = b = c = d = 0

    for i, j in combinations(range(len(X)), 2):

        a += X[i] == X[j] and y[i] == y[j]
        b += X[i] != X[j] and y[i] != y[j]
        c += X[i] == X[j] and y[i] != y[j]
        d += X[i] != X[j] and y[i] == y[j]

    return (a + b) / (a + b + c + d)

Custom Metrics

Evaluating is sometimes difficult to express mathematically. But if it is possible, custom metrics are possible. The metric should ultimately represent what it means for your classification to be good, whatever it may mean in your application.