How can we expand Bayes spam filtering to multiple key words?



In the lesson on spam filtering using Bayes’ theorem, we’re asked to determine whether or not an email should be classified as spam given that it contains the word enhancement. In real world scenarios we can’t make a good decision based only on one word. How can we expand this approach to multiple words?


Recall that in writing P(A|B), A and B are just events. So, simply, they can be more complex than just a single word. If we want to make this Bayes spam filter more realistic, we can look at each of the words w1, w2, ... ,wn in the email and consider

P(spam | w1, w2, ... , wn)

the probability that the email is spam given all the words in the email under consideration. It may look a bit more involved but the ways that we compute the probabilities are not much more complicated; B just happens to equal w1, w2, ..., wn:

P(w1, w2, ... , wn) is the probability that each of these words appear in any email
P(w1, w2, ... , wn | spam) is the probability that a spam email contains each of these words