Cleaning a Mail Corpus

Here's a few methods used to deal with common forms of corpus pollution – messages in a mail corpus that aren't suitable for use in a MassCheck.

What A Corpus Needs To Look Like

SpamAssassin relies on corpus data to generate optimal scores. This is the policy used by all corpora accepted by the SpamAssassin project (moved here from 'masses/CORPUS_POLICY'):

Cleaning Out False Positives

To clean a spam corpus of FalsePositives – first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

cd /path/to/your/spamassassin/masses
sort -n -k 1 spam.log | head -200 > id.low
./mboxget < id.low > mbox
mutt -f mbox

(you could use another mail client if you want, it's just a std UNIX-format mbox file.)

Now, delete all messages that really are spams, and not false positives (or bounces, or virus blowback, or other kinds of undesirable messages). Quit and save the mbox. It now contains only the 'bad' messages.

You can then take that mbox file, grep out the original MassCheck message id strings, and remove those lines from the 'spam.log' file:

grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps
./remove-ids-from-mclog id.fps < spam.log > spam.log.new
mv spam.log.new spam.log

You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)

Rules that are useful for spotting FPs in the spam corpus:

Cleaning Out False Negatives

Doing the same operation to clean the ham corpus of FalseNegatives is similar, but reverses a few things... here's the commands to do that:

cd /path/to/your/spamassassin/masses
sort -rn -k 1 ham.log | head -200 > id.hi
./mboxget < id.hi > mbox
mutt -f mbox

Delete the messages that are good, usable ham, leaving only spams, hams that include bits of spam, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.

grep X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns
./remove-ids-from-mclog id.fns < ham.log > ham.log.new
mv ham.log.new ham.log

Repeat, if necessary...

Rules that are useful for spotting FNs (or spam discussions!) in the ham corpus:

Corrupt Messages

Occasionally, these will crop up – some MUAs have a tendency to mess up mail messages or folders, making them unsuitable for use with MassCheck. SpamAssassin includes a few rules that can help identify corrupt messages.