# Chapter 6: Summary

# Pros

- Number of clusters can be defined
- Scales well
- Easy to implement
- Easy to interpret

# Cons

- Number of clusters has to be defined
- Is biased towards spherical clusters of the same size
- Appropriate distance measure has to be picked
- Convergence is dependent on initialization

# What data does K-Means perform best on?

It is interesting to see that the distance measures used for K-Means, K-Medians, or K-Medoids define the shape that the clusters will tend to adopt. Indeed, because K-Means reduces the \(L_2\)-norm, the clusters will tend to be circular, whereas K-Medians, as it reduces the \(L_1\)-norm, tends to produce “square” clusters.

Because of this, it is advised to use K-Means, or its counterparts, on data which has similar shapes.

# What is the best value for \(K\)?

As it is explained in chapter 5, the best value of K can be viewed as the number of clusters which decreases the inertia the most. As the inertia will reduce as the number of K increases, one must look at the rate at which it reduces. In some applications, qualitative measures have to be taken to ensure that the value of K is indeed best.

# How to deal with large amounts of data?

As K-Means uses distance measures, it can become computationally intensive quite quickly. There exist ways to cope with this. One is to use more advanced data structures, such as K-D trees, for instance, to reduce the number of pairwise distances that need to be computed each time. Another is to use a variant of K-Means calledÂ *Mini-Batch K-Means*. It works exactly like K-Means but at each step, the update is only made on a subset of the data, a batch.

# Can K-Means be used in combination with other models?

Yes. K-Means can be used as a meta-learner. Instead of training a single classifier, for instance, the data can first be clustered, and then a classifier can be trained on each cluster. This can be viewed as some sort of dimensionality reduction. Because K-Means groups similar points, the dimensionality of each cluster should be smaller than that of the whole dataset. And lower-dimensional data is typically easier to deal with.