Chapter 6: Summary

Pros

  • Number of clusters can be defined
  • Scales well
  • Easy to implement
  • Easy to interpret

Cons

  • Number of clusters has to be defined
  • Is biased towards spherical clusters of the same size
  • Appropriate distance measure has to be picked
  • Convergence is dependent on initialization

What data does K-Means perform best on?

It is interesting to see that the distance measures used for K-Means, K-Medians, or K-Medoids define the shape that the clusters will tend to adopt. Indeed, because K-Means reduces the \(L_2\)-norm, the clusters will tend to be circular, whereas K-Medians, as it reduces the \(L_1\)-norm, tends to produce “square” clusters.

Because of this, it is advised to use K-Means, or its counterparts, on data which has similar shapes.

What is the best value for \(K\)?

As it is explained in chapter 5, the best value of K can be viewed as the number of clusters which decreases the inertia the most. As the inertia will reduce as the number of K increases, one must look at the rate at which it reduces. In some applications, qualitative measures have to be taken to ensure that the value of K is indeed best.

How to deal with large amounts of data?

As K-Means uses distance measures, it can become computationally intensive quite quickly. There exist ways to cope with this. One is to use more advanced data structures, such as K-D trees, for instance, to reduce the number of pairwise distances that need to be computed each time. Another is to use a variant of K-Means called Mini-Batch K-Means. It works exactly like K-Means but at each step, the update is only made on a subset of the data, a batch.

Can K-Means be used in combination with other models?

Yes. K-Means can be used as a meta-learner. Instead of training a single classifier, for instance, the data can first be clustered, and then a classifier can be trained on each cluster. This can be viewed as some sort of dimensionality reduction. Because K-Means groups similar points, the dimensionality of each cluster should be smaller than that of the whole dataset. And lower-dimensional data is typically easier to deal with.