Tuesday, July 08, 2008

Fighting spam with open-source tools

Being on many publicly accessible mailings lists exposes one's email address to spammers. So it happens that I receive a hefty amount of spam, i.e. 400 to 500 messages per day. Believe me when I say that pressing the "del" key for 15 minutes is not a good way to start a day.

Over the years I have tried several methods of varying complexity in order to cope with the scourge of spam.

We host our own email server. It uses Postfix. Postfix supports three strategies for filtering content. Initially, the "before queue, external" strategy seemed the most attractive to me. However, most of the examples on the web are based on the "after queue, external" strategy. So I decided to follow recipes found on the web for integrating SpamAssassin and postfix. Actually, the basic integration example found on the spamassassin wiki is an excellent start. Once you've got the basic version working, a simple variation thereof gives you the ability to quarantine messages. If you are not satisfied by just following recepies but would like to understand how the various pieces fit together, then you should read "Fighting Spam and SpamAssasin and Postfix".

Being essentially a rule engine, SpamAssassin can be integrated with other spam-fighting tools such as DCC and Razor. Given that we use Gentoo as our Linux distribution, for us this was easy as issuing the following two commands as root:
emerge mail-filter/razor
emerge mail-filter/dcc

I then had to change the following two lines in the /etc/spamassassin/local.cf file, from:
use_razor2  0
use_dcc 0
to:
use_razor2  1
use_dcc 1
SpamAssassin assigns a score for each message it filters. SpamAssassin was configured to consider messages with scores 6.0 or above as spam. Such messages have their subject lines modified to contain "**** SPAM ****" as a prefix. SpamAssassin, or any spam-figting tool, may incorrectly identify a legitimate message as spam. Such occurrences are called false positives. They must be avoided, unless you don't mind loosing legitimate correspondence.

To reduce the probability of false positives, we adopted a two pronged strategy. Messages high spam scores, 9.0 and above, are quarantined in a special directory accessible only to the system administrator. They are not delivered to the mailbox of users. However, messages scoring "low", between 6.0 and 9.0, are still delivered to the user, with "**** SPAM ****" added to the subject line. With most email clients, e.g. Thunbderbird, it is rather easy to create a filtering rule which automatically moves messages with a subject line containing "**** SPAM ****" to a special SPAM folder. We thus give the user an opportunity to double check low-scoring spam messages before deleting them.

With this configuration and in the last 24 hours, SpamAssassin has quarantined 700 messages as spam (scores of 9.0 or above) for our whole site. My own various mailboxes received 50 low-scoring spam messages. Since such messages are filtered automatically, I am not distracted by them. Only 30 messages reached my mailbox in pristine form, of which 20 were legitimate and 10 were spam. This is such as huge improvement over the 20 spam to one legitimate message ratio we had previously.

For the next couple of weeks I will be looking at the high-scoring quarantined messages to check whether legitimate messages were mistakenly identified as spam (false positives). I am happy to report that there were no false positives in the 700 high-scoring messages nor in the 50 low-scoring messages.

No comments: