Chapter 5: Classification

Classification is a supervised task where a model maps input to a discrete output, often referred to as the class. Classification is a specific case of regression. More formally, a classification problem can be defined as learning a function \(f\) that will map input variables \(X = x_0, x_1,\dots, x_{m-1}, x_{m}\) to a discrete target variable \(y\) such that \(f(x) = y\).

So for instance, let’s say that we have the following data:

Variable 1 Variable 2 Variable 3 Variable 4 Target Variable
1 2 3 4 Yes
2 3 4 5 No
3 4 5 6 Yes
2000 2001 2002 2003 No

We would want to learn some function such that \(f(1,2,3,4) = \text{Yes}\) and \(f(2,3,4,5) = \text{No}\) and so on.

Classification is often seen as finding decision boundaries in data. Finding a boundary which separates the multiple classes.

Classification can be used for a large array of tasks. Below, we list a few practical examples where it could come in handy.

Detecting spam e-mails

Input variables: Number of times certain words appear in the e-mail
Target variable: Whether an e-mail is a spam

Spam Mom Loan Hello Spam
4 0 3 0 Yes
0 1 0 1 No
0 3 1 2 No
5 0 0 1 Yes

This could be useful to create a spam filter.

Facial recognition

Input variables: Intensity of pixels in a 100x100 photo
Target variable: Person’s name

(0,0) (0,1) (99,98) (99,99) Person
0.81 0.72 0.41 0.55 Valentin
0.23 0.12 0.07 0.92 Jack
0.54 0.48 0 0.31 Robin
0.71 0.79 0.37 0.81 Lisa


Predict whether a team will win a basketball game

Input variables: Win percentage, win percentage of the opponent, rebound per game, rebound per game by the opponent
Target variable: Likelihood of defaulting

Win % Opponent Win % RPG Opponent RPG Will win the game
0.65 0.33 45.8 42.2 Yes
0.54 0.47 37.6 44.3 No
0.28 0.77 38.1 48.7 No
0.38 0.43 37.8 36.9 Yes

This could be used by a team to know when they could rest their star players.