Chapter 7: Summary


  • Can fit high-dimensional functions
  • Can be solved directly or online
  • Learning can be tweaked through the loss function
  • Weights can be interpreted


  • Can tend to overfit
  • Many hyperparameters to tune
  • Weights are not guaranteed to be interpretable
  • Sensitive to noise

Can we interpret the weights?

The learned weights of a Linear Regression model can be interpreted but it should be done with caution. Indeed, these weights were learned because they minimized the loss function, and not because they are necessarily interpretable. It is possible, however, through modifying the loss function, to add additional constraints on the range in which weights are expected to be found, which might help interpretability.

How can LASSO be used for feature selection?

LASSO can be used for feature selection because it shrinks the weights associated to the features to zero. By looking at the order in which it does so, one could infer a ranking of “importance” for the features. However, it should be noted that even though a feature might be shrunk to zero, it doesn’t mean that it doesn’t carry valuable information. It just means that it doesn’t carry information that the other features currently.

How to deal with large amounts of data?

As it was explained in the Gradient Descent chapter, gradient descent can be used to deal with large amounts of data. Indeed, it is possible to use a batched version of the algorithm where the data is fed in stream-like fashion. This enables online learning and makes Linear Regression a scalable algorithm.

It is important to note that there exist many variations of the gradient descent algorithm. All trying to deal with different pitfalls of the simple version that was introduced in this learning unit.