Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason] duh, it was right the first time. time to stop working I think

...

To clean a spam corpus of FalseNegatives FalsePositives – first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

...

No Format
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fnsfps
./remove-ids-from-mclog id.fnsfps < spam.log > spam.log.new
mv spam.log.new spam.log

You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fnsfps' file.)

Doing the same operation to clean the ham corpus of FalsePositives FalseNegatives is similar, but reverses a few things... here's the commands to do that:

...

Delete the messages that are good, usable ham, leaving only spams, hams that include bits of spam, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.

No Format
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fpsfns
./remove-ids-from-mclog id.fpsfns < ham.log > ham.log.new
mv ham.log.new ham.log

...

Rules that are useful for spotting FPsFNs (or spam discussions!) in the ham corpus:

  • BAYES_99: once a mass-check completes, it's worth grepping the ham.log for BAYES_99 and checking what mails it hits.
  • any of the other top-listed rules in the HitFrequencies report, especially network tests such as the SURBL rules

...