...
To clean a spam corpus of FalseNegatives FalsePositives – first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
...
No Format |
---|
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fnsfps ./remove-ids-from-mclog id.fnsfps < spam.log > spam.log.new mv spam.log.new spam.log |
You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fnsfps' file.)
Doing the same operation to clean the ham corpus of FalsePositives FalseNegatives is similar, but reverses a few things... here's the commands to do that:
...
Delete the messages that are good, usable ham, leaving only spams, hams that include bits of spam, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.
No Format |
---|
grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fpsfns ./remove-ids-from-mclog id.fpsfns < ham.log > ham.log.new mv ham.log.new ham.log |
...
Rules that are useful for spotting FPsFNs (or spam discussions!) in the ham corpus:
- BAYES_99: once a mass-check completes, it's worth grepping the ham.log for BAYES_99 and checking what mails it hits.
- any of the other top-listed rules in the HitFrequencies report, especially network tests such as the SURBL rules
...