Chapter 4: LASSO regularization

One of Machine Learning’s big challenges is avoiding overfitting. Overfitting typically means that a model is modelling a function that is more complex than the one it is supposed to. This is often due to noise being present in the data, tricking the models into “thinking” that the problem is more complex than it really is. It also happens when the model is more complex than it needs to be.

With linear regression, it is possible to fit arbitrarily high-dimensional functions, but it doesn’t always mean it is a good idea. Let’s take a look at the example below.

The data represents a one-dimensional function, a simple “line”. However, the model tried to fit a 5-dimensional function. Whilst it seems to be fitting the regular points correctly, it is evident that the model has been affected by the two noisy points.

One way to deal with this is to use regularization. Regularization is the process of adding constraints on the weights of a model. This can be done by adding that term in the loss function and in its gradient.

    def _loss_function(self, X, y):

        prediction_loss = lambda weights: 0.5 * (y - self.predict(X, weights)) ** 2
        regularization_loss = lambda weights: self.regularizer(weights)

        return lambda weights: np.mean(prediction_loss(weights) + regularization_loss(weights))

    def _loss_gradient(self, X, y):

        features = add_dummy_feature(X) if self.fit_intercept is True else X

        prediction_loss_gradient = lambda weights: (self.predict(X, weights) - y).dot(features) / len(features)
        regularization_loss_gradient = lambda weights: self.regularizer.gradient(weights)

        return lambda weights: prediction_loss_gradient(weights) + regularization_loss_gradient(weights)

The ultimate goal is to “discourage” a model to be too high-dimensional. One way to define this constraint is LASSO. LASSO stands for Least Absolute Shrinkage and Selection Operator. It regularizes the model by adding the sum of the absolute weights to the loss function.

class LASSO:

    def __init__(self, _lambda):
        self._lambda = _lambda

    def __call__(self, theta):

        return self._lambda * np.sum(np.abs(theta))

    def gradient(self, theta):

        return self._lambda * np.sign(theta)

The regularization is controlled by a factor typically referred to as Lambda. If Lambda equals 0, there is no regularization, otherwise, the larger the value of Lambda, the more the regularization term will impact the loss function.

Using LASSO ensures that the function remains as low-dimensional as possible, as long as it doesn’t affect the loss too negatively. This leads to the function fitted by the model to be simpler, which in general is a good thing.