Bayesian Email Filtering

Ever since I read Jim Daniel’s article on SitePoint regarding Bayesian spam filtering, I’ve been wanting to get my hands on it. The article concerns a product called Spamnix, which is currently only available for Qualcomm Eudora. He listed a few suggestions for Outlook (and/or Outlook Express) users, but nothing that looked too promising.

Well, yesterday I took another look at Mozilla’s Thunderbird mail client. Thunderbird is meant as a running mate for Firebird. Turns out, it has resident Bayesian spam filtering. These Mozilla people think of everything, don’t they?

Bayesian filtering is an adaptive form of spam filtering. This means there’s a training period of about two weeks before it starts to get really accurate. However, since everyone’s email habits are different (spam is in the eye of the beholder), it’s the best solution I’ve seen so far.

Rev. Thomas Bayes pioneered the math involved in Bayesian filters way back in the 1700’s. It has a lot to do with probabilities. Who’d have thought it would be applied to technologies this guy probably never dreamed about? You can learn more about the technique behind Bayesian filtering in an article called “A Plan for Spam“, by Paul Graham.

Anyway, I’ve been playing with Thunderbird now for a couple of days. Just downloaded it at work. It’s an alright little mail program. Every bit as good as Outlook Express, but a bit featurless when it comes to saving attachments. Specifically, it seems to lack a drag n’ drop ability, and for some odd reason you can’t reply to attached emails. So, it looks like I’m not gonna be able to use it for my work email (since my manager sends me thousands of attached support requests every day). The junk mail filter is already catching some spam after only a few hours of training it. It’s pretty dumb though, which is to be expected. I’ll let you know if it gets smarter.

  • Doesn’t SpamAssasin to Bayesian filtering?

    If you’re into Bayesian stuff, then take Machine Learning. You’ll learn all about it. It’s actually one of the more straightforward machine learning techiniques, although it requires a lot of data.

    Bayes Law can be found at MathWorld. Along with some independence assumptions, this is basis of bayesian learning, the idea behind bayesian mail filtering.

  • Joey

    Yeah, SpamAssasin does use Bayes filtering. It also has a bunch of its own rules—sortof a hybrid approach. It is server side though, so I’m not sure how you are supposed to train the filter (unless it comes pre-trained?).

    FellowSites accounts have access to SpamAssasin, but I honestly haven’t played with it much (or rather, I’ve never played with it). Maybe I should turn it on and try it out, eh? [Smiley Face]

  • You can set up SpamAssasin to work locally if you can get your email client to send your email through a arbitrary program as it recieves it. SpamAssasin will then add the X-Spam-Flag header to your email which you can then filter on.

  • Jamie

    I don’t remember having this problem before version Thunderbird 0.6, but Thunderbird no longer has these issues. I have successfully draged and dropped an excel spreadsheet from the e-mail client to the desktop, and replied/forward mail with said attachment. I’ve used Thunderbird as my work e-mail client for a year with no problems. I too would like to try out the Bayesian filtering, but I do not receive spam at the work address. :)

  • I have a simple proposal. I have noticed that most spam messages play tricks like the one defined in the article by Paul Graham (“c0ck”). I think the biggest measurable difference between spam and regular mail is the percentage of misspelled words. If you ran a spell checker over all your incoming messages you would find that normal messages would have less than 20% misspelled words (most likely due to jargon) and spam would most likely have more than 50% misspelled words. This would seperate the two, easily and efficiently because spammers have to use misspelled words to avoid spam-blockers. And if a false positive did get through, the moron had 50% or more misspelled words, so do you really care what the moron had to say anyways?

  • Sorry it took me so long to moderate your comment, Jay. I think you may be onto something.