Chapter 1: Introduction

Naive Bayes is a technique used for classification that has been studied since the 1950s. It was first used for text categorization, predicting whether a text is of one type or another, spam or not spam, English or German, for instance. It is still used to this day as a benchmark for this class of tasks. Of course, it can be used for any classification task.

Naive Bayes classifiers are generative models. This means that they will try to learn how instances of a certain class are generated by looking at what they inherently have in common. Let’s illustrate this with an example. Let’s say that we are given a dataset containing information about weight, height, and sex of people.

By looking at this data, we could infer that males seem to be taller and heavier than females. Therefore, if we had to generate new male data points, we would be more likely to generate people measuring 1m80 and weighing 85kg rather than people measuring 1m60 and weighing 60kg.

Therefore, if we encounter a person measuring 1m92 and weighing 102kg, we would classify them as males because we would be more likely to generate such a person as a male rather than a female.

Bayes’ Theorem

To represent this concept mathematically, we make use of Bayes’ Theorem. Which is represented by the following formula:

\(P(\text{Class} \mid \text{Features}) = \frac{P(\text{Class}) P(\text{Features} \mid \text{Class})}{P(\text{Features})}\)

The theorem gives us a way to calculate the probability of a class (male or female) given some features (height and weight). Therefore, to make a prediction, we will need to calculate the probability for both classes and pick the class with the highest one.

If we are trying to classify a person measuring 1m80 and weighting 80kg, Bayes’ Formula takes the following form:

\(P(\text{Gender} \mid \text{Height}=180, \text{Weight} = 60) = \frac{P(\text{Gender}) \space P(\text{Height}=180, \text{Weight} = 60 \mid \text{Gender})}{P(\text{Height}=180, \text{Weight} = 60)}\)

Which can be read in statistical term as the following:

\(\text{Posterior} = \frac{\text{Prior} \times \text{Likelihood}}{\text{Evidence}}\)

The prior represents the belief that an instance is of a certain class “prior” to knowing its features. This is equivalent >to guessing whether someone is male or female without knowing anything about them.

The likelihood represents how “likely” certain features are to appear given a class. For instance, it is more “likely” that a man is 1m90 rather than 1m50.

The evidence represents the belief that the features, which we use as “evidence” to infer the class, are correct. For instance, using the fact that a man is 1m80 and 80kg as evidence is stronger than using the fact that a male is 2m15 and 45kg, as it is much less probable to be true.

The posterior probability represents the belief that an event will occur “after” taking the features into account. For instance, how likely it is that someone is male, “after” we’ve considered that they measure 1m80 and weight 80kg.

To simplify the formula, we can ignore the evidence as it is independent of the class. This means that we are now calculating the joint probability, which is proportional to the posterior probability. We can also make use of the strong assumption of independence, which is why these classifiers are called “Naïve”. What does this mean in practice? It means that our formula now looks like this:

\(P(Gender \mid Height=180, Weight = 60) \propto P(Gender) \space P(Height=180 \mid Gender) \space P(Weight = 60 \mid Gender)\)

The strong assumption of independence simplified the likelihood by assuming that gender and height are independent of one another. It means that now, to calculate the posterior, and ultimately being able to classify an instance, we need to calculate these probabilities.