Chapter 3: Gradient Descent for Linear Regression

As it was mentioned in the previous chapter, whilst Linear Regression can be sometimes solved analytically, it doesn’t always work in practice. Gradient descent is an iterative optimizer which works quite well in practice. The core principle is that it uses the gradient of the loss function in order to know in which direction the weights should go.

For Linear Regression, we can define the loss function and its gradient as follows:

    
    def _loss_function(self, X, y):

        prediction_loss = lambda weights: np.mean((y - self.predict(X, weights)) ** 2) * 0.5

        return lambda weights: prediction_loss(weights)

    def _loss_gradient(self, X, y):

        features = add_dummy_feature(X) if self.fit_intercept is True else X

        prediction_loss_gradient = lambda weights: (self.predict(X, weights) - y).dot(features) / len(features)

        return lambda weights: prediction_loss_gradient(weights)

In this case, the loss is identical to the Mean-Squared Error, and therefore, it represents how close the predictions that the model makes are to the correct values, penalizing large errors proportionally more. Being able to calculate the loss and its gradient means that we can know in which direction we should update our weights in order to reduce the loss, which is the ultimate goal.

Thankfully, it can also be shown that the Linear Regression loss is convex, which means that it will look like a bowl and will only have one minimum.

Gradient descent is an iterative algorithm which works as follows:

  • Initialize some weights
  • Calculate the gradient of the loss for these weights
  • Update each weight in the direction of the gradient
  • Continue until convergence

In practice, convergence happens when the loss stops decreasing, or when a set number of iterative steps have been computed.

Using gradient descent instead of the analytical solution means that we need to slightly modify the code. The fit function will now be split in two. First, the main fit function will simply call the new _fit function in order to simulate each iteration of the gradient descent optimization.

    def fit(self, X, y):

        self.coef_ = self._initialize_weights(X)

        for weights, loss in self._fit(X, y):

            self.coef_ = weights
            self.loss_ = loss

        return self

Whilst the optimization itself will take place in the new _fit method.

    def _fit(self, X, y):

        weights = self.coef_

        loss_function = self._loss_function(X, y)
        loss_gradient = self._loss_gradient(X, y)

        old_loss = float('inf')
        new_loss = loss_function(weights)

        while old_loss - new_loss > self.tol:

            yield weights, new_loss

            old_loss = new_loss
            weights =  weights - self.learning_rate * loss_gradient(weights)
            new_loss = loss_function(weights)