Differences between revisions 12 and 13
Revision 12 as of 2009-09-20 23:16:20
Size: 2390
Editor: localhost
Comment: converted to 1.6 markup
Revision 13 as of 2012-10-30 16:55:01
Size: 3252
Editor: JohnHardin
Comment: Improvements/clarifications
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:


Line 8: Line 5:
bayes_path /var/spamassassin/bayes/bayes bayes_path /var/spamassassin/bayes_db/bayes
Line 11: Line 8:
Note that the argument to bayes_path is a combination of a directory (/var/spamassassin/bayes_db/) and a filename prefix (bayes).
Line 12: Line 10:
This tells the system that the Bayesian filter database files will be /var/spamassassin/bayes/bayes_msgcount, _seen and _toks. Feel free to move it wherever you want. Please note this directory needs to be RWX to all users that SpamAssassin will be executed as, and many use world RWX to simplify this. The directory also shouldn't contain any files other than your bayes database. If it contains any files that start with "bayes_" it can break the locking mechanisms SpamAssassin uses. This tells the system that the Bayesian filter database files will be /var/spamassassin/bayes_db/bayes_msgcount, _seen and _toks. Feel free to move the database wherever you want. Please note this directory needs to be RWX for all users that SpamAssassin will be executed as, or R-X if autolearning and automatic expiry are disabled; many use world RWX to simplify this, but this is insecure and not recommended. The directory also shouldn't contain any files other than your bayes database. If it contains any other files that start with "bayes_" (or whatever other filename prefix you specified) it can break the database locking mechanisms SpamAssassin uses.
Line 14: Line 12:
Now start feeding the Bayesian filter spam and ham messages.   Now start feeding the Bayesian filter spam and ham messages.
Line 20: Line 18:
Do not simply use your inbox to train Bayes! The mailboxes of ham and spam messages used for training should be hand-verified, and should be kept after the initial training in case retraining is ever needed to correct problems with Bayes. It is safe to run sa-learn against the same mailbox multiple times, as a given message will only be learned once (unless its classification as ham or spam has changed).
Line 21: Line 20:
See SiteWideBayesFeedback for more tips on getting an entire site to feed back spam and ham messages into the Bayesian filter.  See SiteWideBayesFeedback for more tips on getting an entire site to feed back spam and ham messages into the Bayesian filter.
Line 30: Line 29:
Your method of restarting spamd may differ, but the above is typical. If you're using any MTA integrations that invoke SpamAssassin as a perl API (i.e. Amavis, MailScanner or mimedefang), that process will need to be restarted or told to reload its configuration as it is effectively it's own spamd.
Line 31: Line 31:
Your method of restarting spamd options may differ, but the above is typical. If you're using any MTA integrations that invoke SpamAssassin as a perl API (ie: MailScanner or mimedefang) that process will need to be restarted or told to reload its configuration as it is effectively it's own spamd. Restarting spamd/Amavis/MailScanner/mimedefang is not needed after maintenance training or a background expiry, just when you enable or disable bayes.
Line 33: Line 33:
You may experience difficulties with file permissions. Make sure you chmod any existing bayes files to readable/writable by your user groups (or world if you're doing so).  You may experience difficulties with file permissions. Make sure you chmod any existing bayes files to readable/writable by your user groups (or world if you're doing so).
Line 35: Line 35:
If you are going to use group rights instead of a world RWX, there are some additional issues you will need consider. If you use spamd and mail gets scanned on behalf of "root" spamd will use "nobody" as its effective user for bayes database access. You should consider this user when planing your group memberships. Also, be aware that the files are deleted and recreated by whatever user happens to be running spamassassin when an expiration is due. If you are not using world RWX this means you need to beware the files will loose their group ownership you may have set unless you make the directory setgid.
If you are going to use group rights instead of a world RWX, there are some additional issues you will need consider. If you use spamd and mail gets scanned on behalf of "root" spamd will use "nobody" as its effective user for bayes database access. You should consider this user when planning your group memberships. Also, be aware that the files are deleted and recreated by whatever user happens to be running spamassassin when an expiration is due. If you are not using world RWX this means you need to be aware the files will lose any group ownership you may have set unless you make the directory setgid.
Line 39: Line 38:

Setting up Site-Wide Bayesian Filtering

In local.cf, tell SpamAssassin where to find the Bayesian database files:

bayes_path /var/spamassassin/bayes_db/bayes
bayes_file_mode 0777

Note that the argument to bayes_path is a combination of a directory (/var/spamassassin/bayes_db/) and a filename prefix (bayes).

This tells the system that the Bayesian filter database files will be /var/spamassassin/bayes_db/bayes_msgcount, _seen and _toks. Feel free to move the database wherever you want. Please note this directory needs to be RWX for all users that SpamAssassin will be executed as, or R-X if autolearning and automatic expiry are disabled; many use world RWX to simplify this, but this is insecure and not recommended. The directory also shouldn't contain any files other than your bayes database. If it contains any other files that start with "bayes_" (or whatever other filename prefix you specified) it can break the database locking mechanisms SpamAssassin uses.

Now start feeding the Bayesian filter spam and ham messages.

sa-learn --spam --showdots --dir /path/to/directory/full/of/spam/msgs
sa-learn --ham --showdots --dir /path/to/directory/full/of/ham/msgs

Do not simply use your inbox to train Bayes! The mailboxes of ham and spam messages used for training should be hand-verified, and should be kept after the initial training in case retraining is ever needed to correct problems with Bayes. It is safe to run sa-learn against the same mailbox multiple times, as a given message will only be learned once (unless its classification as ham or spam has changed).

See SiteWideBayesFeedback for more tips on getting an entire site to feed back spam and ham messages into the Bayesian filter.

Also restart spamd if you're running it so that it will re-read local.cf and enable the Bayes filter:

/etc/init.d/spamassassin restart
-or-
service spamassassin restart

Your method of restarting spamd may differ, but the above is typical. If you're using any MTA integrations that invoke SpamAssassin as a perl API (i.e. Amavis, MailScanner or mimedefang), that process will need to be restarted or told to reload its configuration as it is effectively it's own spamd.

Restarting spamd/Amavis/MailScanner/mimedefang is not needed after maintenance training or a background expiry, just when you enable or disable bayes.

You may experience difficulties with file permissions. Make sure you chmod any existing bayes files to readable/writable by your user groups (or world if you're doing so).

If you are going to use group rights instead of a world RWX, there are some additional issues you will need consider. If you use spamd and mail gets scanned on behalf of "root" spamd will use "nobody" as its effective user for bayes database access. You should consider this user when planning your group memberships. Also, be aware that the files are deleted and recreated by whatever user happens to be running spamassassin when an expiration is due. If you are not using world RWX this means you need to be aware the files will lose any group ownership you may have set unless you make the directory setgid.

See Mail::SpamAssassin::Conf(3) for details.


CategoryBayes

SiteWideBayesSetup (last edited 2012-10-30 16:55:01 by JohnHardin)