Chapter 2: Linear Regression

Building a Linear Regression model is straightforward. However, there are many tweaks that can be added to it. We will first start with the simpler version.

import numpy as np

from utils.preprocessing import add_dummy_feature


class LinearRegression:

    def __init__(self,
                 fit_intercept=True):

        self.fit_intercept = fit_intercept

The fit_intercept argument defines whether or not we want to use an intercept. If no intercept is used, the data is assumed to already be centred. Mathematically, the intercept is the offset that is added to a linear combination. It is \(b\) in \(y= mx+b\)

Making predictions with Linear Regression is simple. Once the weights are known, we can do a dot product between them and the features of the data points whose target we’d like to predict.

In case there is an intercept, we need to add it to the features. This intercept is simply a feature that is always equal to 1, regardless of the data point.

    def predict(self, X, weights=None):

        if self.fit_intercept is True:
            X = add_dummy_feature(X)

        if weights is None:
            weights = self.coef_

        return X.dot(weights)

Now that we covered what Linear Regression needs to learn, we’ll focus on how. The goal of the model is to find a set of weights which minimizes a loss function. This loss is typically defined as the Mean Squared Error of the model.

    def _loss_function(self, X, y):

        prediction_loss = lambda weights: np.mean((y - self.predict(X, weights)) ** 2) * 0.5

        return lambda weights: prediction_loss(weights)

Now, we could just try a lot of weights and see which work best. But there are smarter ways to do this. Thankfully, solving \(y = X \cdot w\) also happens to minimize the loss. The equation can be rearranged like this to find the weights: \(w = (X^TX)^{-1}X^{T}y\)

    def fit(self, X, y):

        if self.fit_intercept is True:
            X = add_dummy_feature(X)

        self.coef_ = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

        return self.coef_  

As much as this technique is elegant, it doesn’t always work well in practice. As it involves taking the inverse of a matrix, it doesn’t always work or can be numerically challenging. Plus, it doesn’t permit online learning, as the solution has to be computed from scratch everytime a new data point is added to the data.

In the next chapter, we’ll discuss gradient descent, a fundamental optimization technique which addresses the pitfalls of solving Linear Regression analytically.