You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

How to Increase SpamAssassin Accuracy

Run a recent version

Regular updates of SpamAssassin 3.2.x rules stopped in 2008. Accuracy depends on more recent rules. Upgrade to 3.3.0 or newer.

Run sa-update daily

This is often included in SpamAssassin packaging, but sa-update should be run from cron daily, to get the latest SpamAssassin rules which are generated every day.

Enable network rules

This is the default, but disabling network rules (including DNS rules) causes SpamAssassin to be wrong on about 5 times more emails. Network tests may have been disabled by running spamassassin or spamd with the command line arguments -L or --local. DNS rules may have been disabled with "dns_available no" in local.cf. You should run a local caching DNS server for efficiency.

As of 2011-03-24, without network tests, SpamAssassin is wrong 5.35 times as often on non-spam, and 4.25 times as often on spam.

Install Pyzor and Razor

These are two helper applications with useful (network) rules. If they're installed correctly, the debug output of SpamAssassin will include:

Apr 14 16:24:37.315 [4709] dbg: plugin: loading Mail::SpamAssassin::Plugin::Pyzor from @INC
Apr 14 16:24:37.318 [4709] dbg: pyzor: network tests on, attempting Pyzor
Apr 14 16:24:37.318 [4709] dbg: plugin: loading Mail::SpamAssassin::Plugin::Razor2 from @INC
Apr 14 16:24:37.381 [4709] dbg: razor2: razor2 is available, version 2.84

Verify AWL and the Bayesian classifier aren't poisoned

The AutoWhitelist, and Bayesian classifier when automatically trained, can get trained incorrectly, resulting in scoring email wrong. Verify they are providing useful scores - positive scores for spam, and negative scores for ham (AWL and BAYES_* tests). They can be disabled with:

use_auto_whitelist 0
use_bayes 0

To only disable automatic training of the Bayesian classifier:

bayes_auto_learn 0

Remove any SARE rules

SARE rules have not been updated in years, and are therefore actively harmful.

Enable Sought rules

SoughtRules is a custom rule set generated from spam 4 times a day by a SpamAssassin developer.

Run SPF at your MTA

SPF is intended to operate on the envelope sender (SMTP protocol MAIL FROM) which is not available in a standard way by the time the email gets to SpamAssassin. The solution is to run SPF at your MTA (Message Transfer Agent, such as Postfix, Exim, Qmail, Sendmail, etc.). This is, of course, dependent on what software you're using, but it should insert a Received-SPF: header for use by SpamAssassin. If you do not run SPF at your MTA, you really should set ignore_received_spf_header 1 so you don't end up honoring headers inserted by spammers.

An option for Postfix: https://launchpad.net/postfix-policyd-spf-perl/

Use sa-learn to manually train the Bayesian classifier

If it's worth the time to increase the accuracy of filtration of your own personal email, you can manual sort it into ham and spam folders, and then use sa-learn to train it. This can be used for a group effectively if the group is well trained (not to classify mailing lists they've subscribed to but lost interest in as spam).

Pick a useful threshold

The default threshold is 5, which is used to calculate the scores of all of the tests. Higher numbers will result in fewer emails considered spam - both reducing false positives, and increasing false-negatives. Reducing the threshold below 5 is not recommended. This is configured with:

required_score 5

Filtration at your MTA

While outside the scope of SpamAssassin, it is common to do some configuration at your MTA to reject invalid mail.

(For postfix, some settings you might want to look into: reject_non_fqdn_hostname reject_non_fqdn_sender reject_invalid_hostname reject_unknown_client reject_unknown_sender_domain reject_unauth_destination check_sender_access check_helo_access check_sender_mx_access)

Mass-Check

Participating in NightlyMassChecks means that the scores for many of the SpamAssassin tests take your own emails into account, which is likely to increase your accuracy. You don't even upload your emails, unless you want to, just the test hit rates.

  • No labels