Chapter 6: Clustering
Clustering is an unsupervised task where a model is trying to group instances into separate clusters. It can be seen as an unsupervised classification task. It is unsupervised because, in regular classification, we have a gold standard that we can learn from. With clustering, we don’t know in which cluster instances should be, let alone how many clusters there are. The goal of the model is to infer some form of structure on the data given the data alone.
More formally, a clustering problem can be defined as learning a function \(f\) that will map input variables \(X = x_0, x_1,\dots, x_{m-1}, x_{m}\) to a cluster label \(y\) such that \(f(x) = y\).
Clustering can be used for a large array of tasks. Below, we list a few practical examples where it could come in handy.
Customer segmentation
Input variables: Age, number of visits last year, the average amount purchased
Age | Visits | Average amount purchased |
---|---|---|
40 | 100 | 86.3 |
20 | 47 | 32.5 |
82 | 130 | 54.3 |
… | … | … |
19 | 27 | 15.6 |
This could be useful to understand better the various groups of customers that you are catering to. And also to perhaps create specific marketing campaigns for each group.
Topic Detection
Input variables: Amount of times a word appear in a book
Wand | Killer | … | Crime | Spell | Ghost |
---|---|---|---|---|---|
63 | 12 | … | 64 | 146 | 85 |
14 | 86 | … | 97 | 24 | 35 |
98 | 34 | … | 54 | 189 | 276 |
… | … | … | … | … | … |
324 | 34 | … | 65 | 123 | 65 |
Being able to group books into certain topics could be helpful if you want to start an online bookstore and want to know what labels to use to categorize books.
Tumor Detection
Input variables: Intensity of pixels in a 100x100 photo
(0,0) | (0,1) | … | (99,97) | (99,98) | (0.99,0.99) |
---|---|---|---|---|---|
0.81 | 0.72 | … | 0.41 | 0.55 | 0.51 |
0.23 | 0.12 | … | 0.07 | 0.92 | 0.88 |
0.54 | 0.48 | … | 0 | 0.31 | 0.34 |
… | … | … | … | … | … |
0.71 | 0.79 | … | 0.37 | 0.81 | 0.87 |
This could be useful to help doctors identify important parts of a scan for instance. By automatically identifying specific groups in high resolution pictures, it could assist doctors and make sure that they don’t miss any details of a medical scan.