You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

How can my users feed back mail for the Bayesian learner?

If you want to set up site-wide use of Bayesian classification, you should set up a way for your users to send in misclassified mail to be "learned" from.

A good way is to set up a mailbox where users can send verified spam, or verified non-spam, for the learner to learn from.

One issue here is that you will need all the headers of those messages for the learner to work effectively, including the Received headers. A few mail user agents strip off those headers. For users using Outlook 2000, get them to use

  (double-click to open the mail in its own window)
      -> Actions
      -> Resend This Message

and bounce it to the mailbox address. Then run a cron job intermittently to learn all the mails in that mailbox as spam (or non-spam).

For MUAs (Like Netscape/Mozilla) that do a good job with keeping orignal headers intact, (almost) all you need to do is forward the email to the feedback account and strip off the header added by the forward. This can be done by calling a filter from the ~/.procmailrc file of the learner accounts. ( I apologize for putting these scripts in the Wiki, but I have no publically accessable location to post them, If someone who does has that capability, and could just replace them with links, I'd appreciate it)

I call spamc from /etc/procamilrc, but I make sure that it does't filter mail to is_spam and not_spam

/etc/procmailrc

    # Don't filter mail to is_spam and not_spam
    #
    #       Since we are running sitewide, it could cause a serious bottleneck if
    #       we were to use a lockfile here.  instead, we limit spamd to 20 child
    #       processes in /etc/sysconfig/spamassassin
    #
    #:0fw: spamassassin.lock
    :0fw
    * !^To.*spam@mycompany.com
    * < 256000
    | spamc

~is_spam/.procmailc

    # filter spam feedback
    :0fw: bayes_fixup.lock
    * < 256000
    | /usr/local/adm/bin/bayes_fixup.pl

bayes_fixup.pl is:

#!/usr/bin/perl
#
#       This filter is designed to pull off the forwarding headers for mail
#       forwarded to is_spam or not_spam from an MUA that includes all
#       headers.  ( as opposed to outlook, which does not include all
#       headers, and thus must be resent instaed of forwarded. )
#
#       In a forwarded message from Netscape/Mozilla, you will have:
#
#               From ...
#               ...
#               From: (matches envelope from)
#               ...
#               one or more blank lines
#               -------- Original Message --------
#               From: (a date code for the forwading MUA)
#               The original Headers
#
#       You will not have:
#               Sender:
#
#       Not sure if the Netscape stuff is valid for HTML mode.
#
#       Brian R. Jones  01/30/04 scumpuppy_@_earthlink_._net
#
use strict;

my ($count,$endheader,$sender,$unknown);
my $fwdmarker = "-------- Original Message --------";

my @message = <STDIN>;

#
#       Determine if sender is Outlook, Netscape/Mozilla or unknown.
#       If Netscape, set a marker for the end of the headers that are added
#       by the forwarding.
#
for( $count = 0, $endheader = 0, $sender = 0, $unknown = 0; ; $count++ ) {
        $_ = $message[$count];
        /^Sender:/o and last;     # It's a resent message from Outlook, skip
        /^\s*$/o and do {         # end of headers marked with one or more
                $endheader = 1;   # blank lines
                next;
        };
        next unless $endheader;
        /^$fwdmarker/o or $unknown = 1;
        last;
}
#
#       If it's Netscape, delete the forwarding header, and clean up the
#       original. I'm also converting the 'From:' to the 'Envelope From'
#       which may not be legitimate.  It may be better to use the forward
#       header 'Envelope From'.  Unfortunately, there is no way to capture
#       the original 'Envelope From'.  :(
#
if ( $endheader && ! $unknown ) {           # forwarded from known mailer
        splice(@message, 0 , ++$count);
        $message[0] =~ s/^From:/From/;
        for ( @message ) {                  # Stupid Netscape collapse continuation lines,
                                            # so we need to put `em back in case sa-learn
                                            # doesn't understand `em.
                /^[\w\-]+:/     and next;   # Valid header
                /^\t/           and next;   # Valid Continuation line
                /^From/         and next;   # Newly created Envelope From
                /^\s*$/         and last;   # End of Headers
                $_ = "\t" . $_;             # Malformed continuation line. Add tab.
        }
} elsif ( $unknown ) {          # unknown, toss it.
        exit 1;
}
print @message;

Another option, and one that's easier for most users to use, is to set up two public IMAP folders on your IMAP server, one for MissedSpam, one for NotSpam.

Then ask your users to move messages that SpamAssassin misses into the MissedSpam folder, and move messages that SpamAssassin marked incorrectly as spam into the NotSpam folder.

You can then run sa-learn from a cron job over those folders to update the Bayesian databases.

How to set up site wide aliases on postfix where ham and spam can be sent for learning with Postfix

[http://jousset.org/pub/sa-postfix.en.html The cookbook is available at http://jousset.org/pub/sa-postfix.en.html\], it works fine for [http://www.postfix.org/ postfix] NB Don't call your aliases spam and ham unless you want spammers to flood the ham box

  • No labels