Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Do not train Bayes on different mail streams or public spam corpora. These methods will mislead Bayes into believing certain tokens are spammy or hammy when they are not.
  • To train Spamassassin, you get a mailbox full of messages that you know are spam and use the sa-learn program to pull out the tokens and remember them for later:
    sa-learn --showdots --mbox --spam spam-file
    Then you get a mailbox full of messages you're sure are ham and teach Bayes about those:
    sa-learn --showdots --mbox --ham ham-file
    It is important to do both.
  • The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.
  • If Spamassassin fails to identify a spam, teach it so it can do better next time. Run it through the sa-learn program and it will be more likely to correctly identify it as spam next time. Likewise, if SA puts a ham in your spam folder, run that message through sa-learn --ham ham-folder.
  • It's OK to feed emails with Spamassassin markup into the sa-learn command – sa-learn will ignore any standard Spamassassin headers, and if the original email has been encapsulated into an attachment it will decapsulate the email. In other words sa-learn will undo any changes which Spamassassin has done before learning the spam/ham character of the email.
  • If you or any upstream service has added any additional headers to the emails which may mislead Bayes, those should probably be removed before feeding the email to sa-learn. Alternatively, use the bayes_ignore_header setting in your local.cf (as detailed in the man page for Mail::SpamAssassin::Conf).
  • An example of a ham-file could be ~/mail/saved-messages, or wherever your email client saves messages. Make sure all spam is deleted before using sa-learn on a ham-file.
    Similar to the training example above, for a maildir format mailbox, the commands should be altered as shown below.

For a mailbox you're sure contains only spam messages,

...

  1. OBVIOUSLY – change the options at the very beginning.

2. The -z option to rsync automatically uses gzip compression; no need to do this first. Also it will only sync newly added parts of the file, it doesn't re upload the file everytime!

...

If you have "maildir" mailboxes, running spamassassin -r multiple times can be tedious for large numbers of spam. So you can use this report_spam.pl script to run it for you. The script is written in perl. You can save the script to your spamassassin computer and then run it using report_spam.pl your_spam_directory. Each message in your_spam_directory will then be learned in bayes and reported to the checksum services.

(KurtYoder)

CategoryBayes