Chapter 4: Bernoulli Naïve Bayes

Another common distribution is the Bernoulli distribution. This one is often used with discrete, or even binary, data. This distribution has the advantage of modelling both the presence and the absence of a feature. If we used Bernoulli Naïve Bayes for a spam filter for instance, the fact that certain words are not in the e-mail will affect the outcome.

import numpy as np

from supervised.nb_classifier import NBClassifier


class BernoulliNB(NBClassifier):

    def _pdf(self, x, p):

        return (1.0 - x) * (1.0 - p) + x * p

Fitting

A Bernoulli distribution is parameterized by a single parameter, which depends on the number of times each feature occurs.

    def _fit_evidence(self, X):

        feature_probas = []

        for feature in X.T:  # take the transpose to iterate through the features instead of the samples

            feature_probas.append(dict(count=np.sum(feature==1)),
                                       n=len(feature)))

        return np.array(feature_probas)

We also keep track of the number of instances that were observed, which will be useful if we need to update the model. Fitting the likelihood then becomes trivial, as it is similar to fitting the evidence for each class.

    def _fit_likelihood(self, X, y):

        likelihood_ = []

        for c in self.classes_:

            samples = X[y == c]  # only keep samples of class c

            likelihood_.append(self._fit_evidence(samples))

        return likelihood_

Getting

Assuming that our model is trained, we need to be able to make use of its state in order to compute the evidence and likelihood. We can then reuse the _pdf that was defined at the beginning.

    def _get_evidence(self, sample):

        evidence = 1.0

        for i, feature in enumerate(sample):

            count = self.evidence_[i]["count"]
            n = self.evidence_[i]["n"]

            evidence *= self._pdf(x=feature, p=count / n)

        return evidence

    def _get_likelihood(self, sample, c):

        likelihood = 1.0

        for i, feature in enumerate(sample):

            count = self.likelihood_[i]["count"]
            n = self.likelihood_[i]["n"]

            likelihood *= self._pdf(x=feature, p=count / n)

        return likelihood

Updating

Updating the model means that given new data, the counts of features has to be updated.

    def _update_evidence(self, X):

        for i, feature in enumerate(X.T):  # iterate through the features instead of the samples

            self.evidence[i]["count"] += np.sum(feature==1)
            self.evidence[i]["n"] += len(feature)

        return self.evidence_

    def _update_likelihood(self, X, y):

        for i, c in enumerate(self.classes_):

            samples = X[y == c]  # only keep samples of class c

            for i, feature in enumerate(samples.T):  # iterate through the features instead of the samples

                self.likelihood_[i]["count"] += np.sum(feature==1)
                self.likelihood_[i]["n"] += len(feature)

        return self.likelihood_