Chapter 2: Common Guiding Principles

Regardless of the task at hand, there exist some guiding principles in order to build good models. These are beliefs, rules-of-thumb, assumptions, whatever you might want to call them, that deeply impact the field of machine learning.

 

The No Free Lunch Theorem

 

The No Free Lunch theorem states that all models, when their performance is averaged over all tasks, will be equivalent. This suggests that there isn’t a “master model” which will always perform well on all tasks. The No Free Lunch Theorem is what makes machine learning exciting. Even if we solved a similar problem before, we have no guarantee that the model we used back then will still perform well on our new problem. This forces machine learning researchers and practicians to constantly come up with new ways to make models work for each task they encounter.

This is a great blog and podcast about the No Free Lunch Theorem by Data Skeptic.

Occam’s Razor

 

Occam’s Razor, also called the law of parsimony, is a core principle or belief in machine learning. It suggests that, if they perform similarly, a simpler model is a better model. In practice, it means that we should always strive to create models that are as simple as possible. This is based on the assumption that the more moving parts there are within a model, the more likely some are to be wrong.

Occam’s razor tells us that there is no need to overcomplicate things, and this is a great lesson to keep in mind. Sometimes, finding an elegant solution might be much more worthwhile than trying to teach an easy task to an overly-complex model.

 

 Bias-variance dilemma

 

According to Occam’s Razor, a model shouldn’t be overcomplicated, but it shouldn’t be overly simplistic either. A model needs to be general enough to be flexible and robust to noise, but it also needs to be precise and specific enough, to be accurate. This is the dilemma, as improving one will hinder the other, and therefore, one needs to find the right spot.

More formally, let’s assume that any observed data point is a combination of its true value as well as noise. Models that have high bias will assume that most of the data they encounter is mostly composed of noise, and will, therefore, try to ignore it. These models are biased against the data they see and this generally means that they will be too simplistic. On the other hand, a model with high variance assumes that the data that it observes is mostly composed of its true value. It believes that the variance that it observes inside the data is due to the data and not due to noise. These models generally tend to be overcomplicated as they will also try to model the noise in the data.