Chapter 1: What is Machine Learning?

The term “Machine Learning” is often considered to have been coined by Arthur Samuel in 1959. Samuel was a pioneer in the field of artificial intelligence and is famous for having written a checkers-playing program, one of the first self-learning programs in the world. He came up with the following definition for Machine Learning:

“Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.” 

The most important part of the definition is what he calls “being explicitly programmed”. Indeed, one could say that machine learning is implicit programming whereas traditional computer science is explicit programming. Let’s say that we would like to build a spam detection system for an e-mail client. The goal of such a system would be to correctly identify spam whilst never filtering out e-mails that are not spam.

Let’s assume that our input is an e-mail object containing information about the content of the e-mail, the sender, the time it was sent etc.

class Email:
    def __init__(self, content, sender, **kwargs):
        self.content = content
        self.sender = sender

spams = [Email(content="Hey, this is spam, give me your money", sender="scammer@aol.com"),
         Email(content="Hey, give me your money please!", sender="spammer@aol.com")]

hams = [Email(content="Hey, this is mom", sender="mum@aol.com"),
        Email(content="Hey, give me your address please!", sender="dad@aol.com")]

We would like to build some function that tells us whether such an e-mail is indeed spam.

Explicit programming – The traditional computer science approach

To explicitly program a computer to detect spam e-mails, we would write code that would detail each decision the computer would have to make when being fed an e-mail as input. This could include analyzing the sender, whether the e-mail contains certain words, and so on.

A simplistic explicit program could, for instance, check whether the string “this is a spam” is present in the e-mail.

class SpamFilter:

    def isSpam(self, email):
        return "this is a spam" in email.content

But perhaps, spammers become smarter and stop using these phrases. Therefore, we could also create a list of blocked senders and check whether the e-mail was sent by one of them.

 
class SpamFilter:

    def isSpam(self, email):
        blocked_senders = ["scammer@aol.com", "spammer@aol.com"]
        return "this is a spam" in email.content or email.sender in blocked_senders

And so on. We could come up with as many rules as we’d like, and end up with a rather elaborate and complex program.

Implicit programming – The machine learning approach

Implicitly programming a computer to detect spam e-mails is not completely different from explicitly programming it. We would also need to write a program that is able to filter spams given a set of rules. The real difference comes with how the rules are created. In the previous example, we had to explicitly write out what the rules were. We had to create the list of blocked senders and define that “this is a spam” is a string that is indicative of spam e-mails. In machine learning, we let the program learn these rules on its own.

So how does that look like? Our `isSpam` function would look very similar to before.

class SpamFilter:

    def isSpam(self, email):
        
        blacklisted_word_found = any([word in email.content for word in self.blacklisted_words])
        sent_by_blocked_sender = email.sender in self.blocked_senders
        
        return blacklisted_word_found or sent_by_blocked_sender

As you may have seen, we did not define blacklisted_words or blocked_senders inside isSpam function. Instead, we will learn what should be inside these variables in a function that we’ll call fit.

We will do that by giving it e-mails that we already categorized as spam or ham. We will define blocked_senders as the set of all senders whoever sent a spam e-mail, and blacklisted_words as the set of words that appeared in a spam e-mail and not in a ham e-mail.

class SpamFilter:

    def isSpam(self, email):
        
        blacklisted_word_found = any([word in email.content for word in self.blacklisted_words])
        sent_by_blocked_sender = email.sender in self.blocked_senders
        
        return blacklisted_word_found or sent_by_blocked_sender
    
    def fit(self, spams, hams):
        
        self.blocked_senders = set()
        words_found_in_spams = set()
        words_found_in_hams = set()
        
        for email in hams:
            for word in email.content.split():
                words_found_in_hams.add(word)
            
        for email in spams:
            
            self.blocked_senders.add(email.sender)
            
            for word in email.content.split():
                words_found_in_spams.add(word)
                     
        self.blacklisted_words = words_found_in_spams.difference(words_found_in_hams)

        return self

This piece of code achieves exactly the same thing as the explicit example if given the e-mails that we defined above. It will learn that spam and money are two words that should be blacklisted and that scammer@aol.com and spammer@aol.com should be blocked. But it does have the advantage that the code doesn’t have to be modified whenever a rule has to be changed.