Filtering Spam – Part I

Judging by the contents of the emails I receive on a daily basis, I can fairly safely say that the world is full with people that don’t bother to look for technical vulnerabilities to hack my computer, but simply try to trick me into compromising my personal files or banking information for their own financial gains. It’s clearly beyond my reach to personally stop those naughty tricksters, but there is at least one element that can be used to filter out those attempts – one of their favorite attack vectors: email.

Scanning for malicious activity in email is essentially a lost battle as attackers have the initiative because for the sake of communication, unknown threats are more than often passed on as valid applications when they’re not. Those type of attacks – directed at the person reading email – are generally referred to as phishing attacks. What makes them special, is that these depend on social engineering rather than malware at the first stage of the attack. Filtering out that rubble from your daily email can be tiresome, just like unsolicited commercial bulk email or spam. As it turns out, filtering phishing can be done in more or less the same way as filtering spam – when using a properly trained spam filter. This post is the first part about setting up a spam filter on a multi-user email server and will focus on training a spam filter. In the next part I’ll discuss how to use this to create a server that can receive email from the vast majority of email servers around the globe.

It has been a while since bayesian filtering was introduced to email. There are a few challenges for using a bayesian classifier on a random population of email users. First, from my personal experience, is that given a diverse group of people using email, you should not assume everybody in your email domain to have the same perspective on what spam is for them. Individuals have preferences, and those must be respected. It’s simply not up to the system administration department to decide for anybody in their user group on where to shop or where or what (not) to buy, unless a company policy dictates otherwise. Second, bayesian filters need to be trained constantly and consistently in order to work properly. That requires email messages, and a lot of them too. Not having that luxury, as email is private, I cannot do that on their behalf, so users are required to train their own classifier and bring their own messages, or simply train their classifier over time with any message. Looking for a solution to this a long time ago, I found bogofilter.

Bogofilter is a bayesian classifier with modifications. If you’ve read about anti-spam solutions you’ve surely must have at least dipped your toes into some of the mathematical backgrounds of this type of filtering (for example in the CRM114 manual, page 159), and I’m not touching those with a 10 foot pole 😉 And although bogofilter may not be the easiest or most feature rich solution, it works pretty much ok. Here’s how:

Although bayesian filtering can be done from an email client, I’m using a client-server setup here and am using bogofilter on the server. In that particular case, bayesian filtering using bogofilter has essentially two aspects: Filtering and training. If you want any user on your server to be able to train a personal filter, then this requires them to have a personal configuration of the filter, next to a personal email directory to store received emails. Most systems store these in the users’ home directory. Interaction with bogofilter can be done from the email client when using the IMAP protocol, with a per user maildir-directory layout on the server. The latter is a requirement as that enables them to train their own classifier. Setting up a server with dovecot to provide basic IMAP services on a Linux/BSD system is not that difficult. Don’t worry if this all sounds a bit vague, I’ll get to this in more detail in the next part of this series. Once you have that up&running, the following shell commands will add directories to the Maildir directory for each user, assuming you have root access to the system and they have a home directory in /home that contains the Maildir directory:

umask 077
for i in /home/*/Maildir
        home=$(echo $i | sed 's@^\(.*\)/Maildir@\1@')
        user=$(grep ":${home}:" /etc/passwd | awk -F: '{print $1}')
        for j in "$i/.Train-as-Spam/cur" "$i/.Train-as-Ham/cur" "$i/.Spam/cur" "$i/.Ham/cur"
                mkdir -p "$j"
                chown "$user" "$j"

This creates the following directories for each user:

  • Train-as-Spam
    Emails placed in this directory will be regarded as spam and moved to the Spam directory after training.
  • Train-as-Ham
    Emails placed in this directory will be regarded as valid messages or ham and moved to the Ham directory after training.
  • Spam
    This directory contains the collection of spam email that was used to train the classifier.
  • Ham
    This directory contains the collection of valid email or ham that was used to train the classifier.

In short: Training the filter can be done from the users’ email client and be as simple as dragging an email to one of the “Train-as” folders. So there’s no need for a user to log on to the mail server or use command line voodoo, this can all be prepared for them in advance. As soon as the message has been used for training, it is moved to either “Spam” or “Ham”. A script to implement this periodically – from a crontab – is below.

# Author: Jacco van Buuren <jacco /at/ bjvb /dot/ nl>
# Descr: Train bogofilter as a user without access to the commandline
# Name this script and place it in /home/bin
# Run this from cron:
# */5 * * * * /home/bin/ spam && /home/bin/ ham



umask 077

[ ! -d "$TRAIN_SPAM_DIR" ] && mkdir -p "$TRAIN_SPAM_DIR"
[ ! -d "$TRAIN_HAM_DIR" ] && mkdir -p "$TRAIN_HAM_DIR"
[ ! -d "$SPAM_DIR" ] && mkdir -p "$SPAM_DIR"
[ ! -d "$HAM_DIR" ] && mkdir -p "$HAM_DIR"

case "$1" in
        files="`ls -1 $TRAIN_SPAM_DIR/* 2>/dev/null`"
        [ "X${files}X" = "XX" ] && exit 0
        /usr/local/bin/bogofilter -Ns -B $TRAIN_SPAM_DIR/*
        mv $TRAIN_SPAM_DIR/* $SPAM_DIR
        files="`ls -1 $TRAIN_HAM_DIR/* 2>/dev/null`"
        [ "X${files}X" = "XX" ] && exit 0
        /usr/local/bin/bogofilter -Sn -B $TRAIN_HAM_DIR/*
        mv $TRAIN_HAM_DIR/* $HAM_DIR
        echo "$1 -- Not implemented. Use HAM or SPAM. Aborted" >&2
        exit 1

This script is run from cron, meaning that each user should have a crontab with an entry like this one:

*/5 * * * * /home/bin/ spam && /home/bin/ ham

…which assumes you’ve stored the script in /home/bin and named it ‘’. Don’t forget executable rights (chmod +x /home/bin/


Bayesian filtering only works when properly trained, and this is where things get kind of fuzzy: In order to filter out spam, you first need to receive spam so you can train your filter with it. Sure you can find samples online, and some will likely match most of whatever spam you may receive, but the best filters are the ones you’ve trained with your personal spam and ham. Phishing and especially spear phishing are the most difficult cases as they are not directed at an audience but just one person: You. Spear phishing can get very personal and reportedly con artists have been using phone calls to further entice their victims. So even when the phishing email ends up properly in the spam-box, don’t be too much surprised if you get a call from someone claiming to have sent you an email with an offer you actually should refuse 😉

As you might have noticed; the filter in this post is pretty much useless as it is not yet acting as a filter because there is no email  arriving yet. So in the next part of this series; Actual use of the trained classifier as an email filter!

This entry was posted in email, IT Security and tagged , , , , , . Bookmark the permalink.