Chapter 6: Summary
Pros
- Number of clusters can be defined
- Scales well
- Easy to implement
- Easy to interpret
Cons
- Number of clusters has to be defined
- Is biased towards spherical clusters of the same size
- Appropriate distance measure has to be picked
- Convergence is dependent on initialization
What data does K-Means perform best on?
It is interesting to see that the distance measures used for K-Means, K-Medians, or K-Medoids define the shape that the clusters will tend to adopt. Indeed, because K-Means reduces the \(L_2\)-norm, the clusters will tend to be circular, whereas K-Medians, as it reduces the \(L_1\)-norm, tends to produce “square” clusters.
Because of this, it is advised to use K-Means, or its counterparts, on data which has similar shapes.
What is the best value for \(K\)?
As it is explained in chapter 5, the best value of K can be viewed as the number of clusters which decreases the inertia the most. As the inertia will reduce as the number of K increases, one must look at the rate at which it reduces. In some applications, qualitative measures have to be taken to ensure that the value of K is indeed best.
How to deal with large amounts of data?
As K-Means uses distance measures, it can become computationally intensive quite quickly. There exist ways to cope with this. One is to use more advanced data structures, such as K-D trees, for instance, to reduce the number of pairwise distances that need to be computed each time. Another is to use a variant of K-Means called Mini-Batch K-Means. It works exactly like K-Means but at each step, the update is only made on a subset of the data, a batch.
Can K-Means be used in combination with other models?
Yes. K-Means can be used as a meta-learner. Instead of training a single classifier, for instance, the data can first be clustered, and then a classifier can be trained on each cluster. This can be viewed as some sort of dimensionality reduction. Because K-Means groups similar points, the dimensionality of each cluster should be smaller than that of the whole dataset. And lower-dimensional data is typically easier to deal with.