Chapter 1: What is Machine Learning?
“Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.”
The most important part of the definition is what he calls “being explicitly programmed”. Indeed, one could say that machine learning is implicit programming whereas traditional computer science is explicit programming. Let’s say that we would like to build a spam detection system for an e-mail client. The goal of such a system would be to correctly identify spam whilst never filtering out e-mails that are not spam.
Let’s assume that our input is an e-mail object containing information about the content of the e-mail, the sender, the time it was sent etc.
class Email: def __init__(self, content, sender, **kwargs): self.content = content self.sender = sender spams = [Email(content="Hey, this is spam, give me your money", sender="scammer@aol.com"), Email(content="Hey, give me your money please!", sender="spammer@aol.com")] hams = [Email(content="Hey, this is mom", sender="mum@aol.com"), Email(content="Hey, give me your address please!", sender="dad@aol.com")]
We would like to build some function that tells us whether such an e-mail is indeed spam.
Explicit programming – The traditional computer science approach
To explicitly program a computer to detect spam e-mails, we would write code that would detail each decision the computer would have to make when being fed an e-mail as input. This could include analyzing the sender, whether the e-mail contains certain words, and so on.
A simplistic explicit program could, for instance, check whether the string “this is a spam” is present in the e-mail.
class SpamFilter: def isSpam(self, email): return "this is a spam" in email.content
But perhaps, spammers become smarter and stop using these phrases. Therefore, we could also create a list of blocked senders and check whether the e-mail was sent by one of them.
class SpamFilter: def isSpam(self, email): blocked_senders = ["scammer@aol.com", "spammer@aol.com"] return "this is a spam" in email.content or email.sender in blocked_senders
And so on. We could come up with as many rules as we’d like, and end up with a rather elaborate and complex program.
Implicit programming – The machine learning approach
Implicitly programming a computer to detect spam e-mails is not completely different from explicitly programming it. We would also need to write a program that is able to filter spams given a set of rules. The real difference comes with how the rules are created. In the previous example, we had to explicitly write out what the rules were. We had to create the list of blocked senders and define that “this is a spam” is a string that is indicative of spam e-mails. In machine learning, we let the program learn these rules on its own.
So how does that look like? Our `isSpam` function would look very similar to before.
class SpamFilter: def isSpam(self, email): blacklisted_word_found = any([word in email.content for word in self.blacklisted_words]) sent_by_blocked_sender = email.sender in self.blocked_senders return blacklisted_word_found or sent_by_blocked_sender
As you may have seen, we did not define blacklisted_words
or blocked_senders
inside isSpam
function. Instead, we will learn what should be inside these variables in a function that we’ll call fit
.
We will do that by giving it e-mails that we already categorized as spam or ham. We will define blocked_senders
as the set of all senders whoever sent a spam e-mail, and blacklisted_words
as the set of words that appeared in a spam e-mail and not in a ham e-mail.
class SpamFilter: def isSpam(self, email): blacklisted_word_found = any([word in email.content for word in self.blacklisted_words]) sent_by_blocked_sender = email.sender in self.blocked_senders return blacklisted_word_found or sent_by_blocked_sender def fit(self, spams, hams): self.blocked_senders = set() words_found_in_spams = set() words_found_in_hams = set() for email in hams: for word in email.content.split(): words_found_in_hams.add(word) for email in spams: self.blocked_senders.add(email.sender) for word in email.content.split(): words_found_in_spams.add(word) self.blacklisted_words = words_found_in_spams.difference(words_found_in_hams) return self
This piece of code achieves exactly the same thing as the explicit example if given the e-mails that we defined above. It will learn that spam and money are two words that should be blacklisted and that scammer@aol.com and spammer@aol.com should be blocked. But it does have the advantage that the code doesn’t have to be modified whenever a rule has to be changed.