TODO: this page needs to talk about sa-learn, and should take the text from section 5 in the FAQ: http://spamassassin.taint.org/faq/index.cgi?req=index

Introduction

The Bayesian database support in Spamassassin tries to identify spam by looking at what are called tokens; short phrases that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the phrase penis enlargment, the Bayesian code is pretty sure that the new message is spam and raises the spam score of that message.

Things to remember.

To train Spamassassin, you get a mailbox full of messages that you know are spam and use the sa-learn program to pull out the tokens and remember them for later: sa-learn --showdots --mbox --spam mbox-file. Then you get a folder full of messages you're sure are ham and teach Bayes about those: sa-learn --showdots --mbox --ham mbox-file .
The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.

Child pages

BayesInSpamAssassin

Introduction