Chapter 5: Evaluating Classification

Machine learning is all about getting better and better at a task. Therefore, we need to define what it means to beĀ good. In the case of classification, it is better to predict as many target variables accurately as possible. But how do we represent that? There exist several performance metrics to evaluate regression.


The accuracy is a straight-forward method which tells us how many classes were correctly predicted overall. It gives a clear idea of how well a model is performing when classes are balanced. Indeed, if classes aren’t balanced, let’s say 95% class 1 and 5% class 2, we could always predict class 1 and already have an accuracy of 95%. But we might not want to never predict class 2.

def accuracy(predicted_target, target):
    return np.mean(target == predicted_target)


To remedy the flaws of simple accuracy, there exist metrics like recall. The recall for each class is defined as the number of instance of that class that were correctly predicted (true positives), divided by the total number of instances of that class (true positives and false negatives).

In other words, if a classifier only predicts class 1 and never class 2, it will have a recall of 1 for class 1 but a recall of 0 for class 2.

Recall is a metric that is calculated for each class. In order to provide a single metric for the classification problem as a whole, we typically compute a weighted average based on how many times each class appears.

def recall(predicted_target, target):

    recall_per_class = {}
    classes, counts = np.unique(target, return_counts=True)

    for c in classes:
        recall_per_class = np.sum((target == c) & (predicted_target == c)) / np.sum(target == c)

    weighted_recall = np.sum([recall_per_class * counts / len(target) for c in classes])

    return weighted_recall, recall_per_class


Another metric that remedies the flaws of simple accuracy is precision. The precision represents how accurate a model is at predicting each class. In other words, if a classifier only recalls 1 instance of class 1 but predicts it correctly, it will have a precision of 1, whereas if it predicts half of class 1 correctly, its precision will be 0.5.

Precision is a metric that is calculated for each class in a classification problem. In order to provide a single metric, we typically compute the weighted average of the precision of each class, based on how many times each class appears.

def precision(predicted_target, target):

    precision_per_class = {}
    classes, counts = np.unique(target, return_counts=True)

    for c in classes:
        precision_per_class = np.sum((target == c) & (predicted_target == c)) / np.sum(predicted_target == c)

    weighted_precision = np.sum([precision_per_class * counts / len(target) for c in classes])

    return weighted_precision, precision_per_class

F-Beta Score

A good model should have a high recall and a high precision. In some cases, one matters more than the other. The \(F_\beta\) measure is defined as the weighted harmonic mean of the recall and the precision and therefore provides a single metric combining the two. The \(\beta\) parameter influences the weight given to the precision. If it is set to 1, precision and recall are weighted equally. If it is less than 1, recall is favoured. If it is more than 1, precision is favoured. Typically, \(\beta\) is set to 1 and the measure is then called the \(F_1\) score.

def fbeta_score(predicted_target, target, beta=1.0):

    classes, counts = np.unique(target, return_counts=True)
    fbeta_score_per_class = {c:0 for c in classes}

    # Computes the recall and precision of these classes
    _, p = precision(predicted_target, target)
    _, r = recall(predicted_target, target)

    # Computes the F-beta score as the harmonic mean between precision and recall
    for c in classes:
        if beta**2 * p + r == 0:
            fbeta_score_per_class = 0 # if precision and recall are 0, then f-beta should also be zero
            fbeta_score_per_class = (1 + beta**2) * (p * r) / (beta**2 * p + r)

    weighted_fbeta_score = np.sum([fbeta_score_per_class * counts / len(target) for c in classes])

    return weighted_fbeta_score, fbeta_score_per_class

Custom Metrics

A simple example would be to use weighted versions of the aforementioned metrics. By doing this, you would loosely make it more important to perform well for certain classes than others. It could also be possible to have a fully custom metric based on a custom error function. Perhaps, your application entails that it is much worse to confuse class 1 with class 2 than it is to confuse it with class 3 for instance.

The metric should ultimately represent what it means for your classification to be good, whatever it may mean in your application.