Chapter 3: Summary


  • Can fit high-dimensional functions
  • Can be solved online
  • Learning can be tweaked through the loss function
  • Class probabilities can be inferred


  • Can tend to overfit
  • Many hyperparameters to tune
  • Class probabilities are not always good estimates
  • Sensitive to noise

How to deal with more than 2 classes?

Typically, when there are more than 2 classes, the one-vs-all approach is used. In this case, for each class, the target is changed into 1 for that class, and 0 for all others. A model is trained for all classes, and then, for each instance, the class with the highest probability is picked. Therefore, the complexity of the algorithm grows linearly with the number of classes.

How can LASSO be used for feature selection?

LASSO can be used for feature selection because it shrinks the weights associated to the features to zero. By looking at the order in which it does so, one could infer a ranking of “importance” for the features. However, it should be noted that even though a feature might be shrunk to zero, it doesn’t mean that it doesn’t carry valuable information. It just means that it doesn’t carry information that the other features currently.

How to deal with large amounts of data?

As it was explained in the Gradient Descent chapter, gradient descent can be used to deal with large amounts of data. Indeed, it is possible to use a batched version of the algorithm where the data is fed in stream-like fashion. This enables online learning and makes Logistic Regression a scalable algorithm.

It is important to note that there exist many variations of the gradient descent algorithm. All trying to deal with different pitfalls of the simple version that was introduced in this learning unit.