Bayes Introduction

The Bayesian database support in Spamassassin tries to identify spam by looking at what are called tokens; short phrases that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the phrase penis enlargment, the Bayesian code is pretty sure that the new message is spam and raises the spam score of that message.

If you're having trouble with Bayes, see BayesFaq for help.

Things to remember

To train Spamassassin, you get a mailbox full of messages that you know are spam and use the sa-learn program to pull out the tokens and remember them for later:

sa-learn --showdots --mbox --spam spam-file

Then you get a folder full of messages you're sure are ham and teach Bayes about those:

sa-learn --showdots --mbox --ham ham-file .

The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.
If Spamassassin fails to identify a spam, teach it so it can do better next time. Run it through the sa-learn program and it will be more likely to correctly identify it as spam next time. Likewise, if SA puts a ham in your spam folder, run that message through sa-learn --ham ham-folder.
It's OK to feed emails with Spamassassin markup into the sa-learn command – sa-learn will ignore any standard Spamassassin headers, and if the original email has been encapsulated into an attachment it will decapsulate the email. In other words sa-learn will undo any changes which Spamassassin has done before learning the spam/ham character of the email.
If you or any upstream service has added any additional headers to the emails which may mislead Bayes, those should probably be removed before feeding the email to sa-learn. Alternatively, use the bayes_ignore_header setting in your local.cf (as detailed in the man page for Mail::SpamAssassin::Conf).

Child pages

BayesInSpamAssassin

Bayes Introduction

Things to remember