Chapter 6: Clustering

Clustering is an unsupervised task where a model is trying to group instances into separate clusters. It can be seen as an unsupervised classification task. It is unsupervised because, in regular classification, we have a gold standard that we can learn from. With clustering, we don’t know in which cluster instances should be, let alone how many clusters there are. The goal of the model is to infer some form of structure on the data given the data alone.

More formally, a clustering problem can be defined as learning a function \(f\) that will map input variables \(X = x_0, x_1,\dots, x_{m-1}, x_{m}\) to a cluster label \(y\) such that \(f(x) = y\).

Clustering can be used for a large array of tasks. Below, we list a few practical examples where it could come in handy.

Customer segmentation

Input variables: Age, number of visits last year, the average amount purchased

Age Visits Average amount purchased
40 100 86.3
20 47 32.5
82 130 54.3
19 27 15.6

This could be useful to understand better the various groups of customers that you are catering to. And also to perhaps create specific marketing campaigns for each group.

Topic Detection

Input variables: Amount of times a word appear in a book

Wand Killer Crime Spell Ghost
63 12 64 146 85
14 86 97 24 35
98 34 54 189 276
324 34 65 123 65

Being able to group books into certain topics could be helpful if you want to start an online bookstore and want to know what labels to use to categorize books.

Tumor Detection

Input variables: Intensity of pixels in a 100x100 photo

(0,0) (0,1) (99,97) (99,98) (0.99,0.99)
0.81 0.72 0.41 0.55 0.51
0.23 0.12 0.07 0.92 0.88
0.54 0.48 0 0.31 0.34
0.71 0.79 0.37 0.81 0.87

This could be useful to help doctors identify important parts of a scan for instance. By automatically identifying specific groups in high resolution pictures, it could assist doctors and make sure that they don’t miss any details of a medical scan.