Here's a few methods used to deal with common forms of corpus pollution – messages in a mail corpus that aren't suitable for use in a MassCheck.
SpamAssassin relies on corpus data to generate optimal scores. This is the policy used by all corpora accepted by the SpamAssassin project (moved here from 'masses/CORPUS_POLICY'):
To clean a spam corpus of FalsePositives – first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
cd /path/to/your/spamassassin/masses sort -n -k 2 spam.log | head -200 > id.low ./mboxget < id.low > mbox mutt -f mbox |
(you could use another mail client if you want, it's just a std UNIX-format mbox file.)
Now, delete all messages that really are spams, and not false positives (or bounces, or virus blowback, or other kinds of undesirable messages). Quit and save the mbox. It now contains only the 'bad' messages.
You can then take that mbox file, grep out the original MassCheck message id strings, and remove those lines from the 'spam.log' file:
grep -a X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fps ./remove-ids-from-mclog id.fps < spam.log > spam.log.new mv spam.log.new spam.log |
You can also remove the offending files, or messages from the source mailboxes, directly. (This is advisable as you'll probably wind up mass-checking them again at some point.) However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)
Rules that are useful for spotting FPs in the spam corpus:
See also 'Corrupt Messages' below for other stuff to clear out.
Here's a command line to grep a log for a rule name, and generate an mbox of the results, then open it in "mutt":
grep 'ALL_TRUSTED' ham.log > grepped.log ./mboxget < grepped.log > mbox mutt -f mbox |
Doing the same operation to clean the ham corpus of FalseNegatives is similar, but reverses a few things... here's the commands to do that:
cd /path/to/your/spamassassin/masses sort -rn -k 2 ham.log | head -200 > id.hi ./mboxget < id.hi > mbox mutt -f mbox |
Delete the messages that are good, usable ham, leaving only spams, hams that include bits of spam, virus blowback, bounces, or whatever other undesirable messages you want to get rid of. Quit and save.
grep -a X-Mass-Check-Id mbox | sed -e 's/^X-Mass-Check-Id: //' > id.fns ./remove-ids-from-mclog id.fns < ham.log > ham.log.new mv ham.log.new ham.log |
Repeat, if necessary...
Rules that are useful for spotting FNs (or spam discussions!) in the ham corpus:
See also 'Corrupt Messages' below for other stuff to clear out.
To make corpus cleaning easier next time, you can save a list of emails that scored high that weren't spam, to automatically skip. When viewing emails as above, they have a "X-Mass-Check-Id:" header which lists the file they came from, which you can use to remove any email that was actually spam from the id.hi file. Then grab the file names out of id.hi with awk, and make a backup copy:
awk {'print $3'} < id.hi > ~/sa/id.hi.good |
Next time, run:
sort -rn -k 2 ham.log | fgrep -vf ~/sa/id.hi.good | head -n 200 > id.hi ./mboxget < id.hi > mbox mutt -f mbox |
Occasionally, these will crop up – some MUAs have a tendency to mess up mail messages or folders, making them unsuitable for use with MassCheck. SpamAssassin includes a few rules that can help identify corrupt messages.
DSPAM is well known standalone bayesian tool, you can crosscheck your corpus fast and easy with it.
It doesn't seem to be maintained anymore, here is probably the best version: https://github.com/ensc/dspam (download the master). If you are not comfortable compiling things, then you need to find some package.
Example how to build and install it simply in your home directory:
unzip master.zip && cd dspam-master # autoconf/automake/gcc stuff obviously needed ./autogen.sh ./configure --prefix=$HOME/dspam --disable-trusted-user-security --disable-syslog make && make install |
This assumes your corpus is in Maildir format (file per message).
Make sure PATH includes $HOME/dspam/bin if installed there.
You can experiment with different learning methods. It's probably best to feed all manually verified messages first with --source=corpus. It's not an exact science, so mixing methods might come up with different FPs/FNs.
Learn the corpus (method 1):
# Always clear old data first rm -rf $HOME/dspam/var # This will learn the folders with --source=error dspam_train $LOGNAME /path/to/spam /path/to/ham |
Learn the corpus (method 2):
# Clear old data unless you are learning some additional corpus rm -rf $HOME/dspam/var # Feed your folders with --source=corpus find /path/to/spam -type f | while read -r f; do dspam --user $LOGNAME --source=corpus --class=spam < "$f" done find /path/to/ham -type f | while read -r f; do dspam --user $LOGNAME --source=corpus --class=innocent < "$f" done |
Check the corpus:
/bin/bash find /path/to/spam -type f | while read -r f; do RESULT=$(dspam --user $LOGNAME --classify < "$f") # Tune confidence >= 0.6 check if needed if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then echo "$f ${BASH_REMATCH[1]}" fi done find /path/to/ham -type f | while read -r f; do RESULT=$(dspam --user $LOGNAME --classify < "$f") # Tune confidence >= 0.6 check if needed if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then echo "$f ${BASH_REMATCH[1]}" fi done |
It will output list of messages to check. Move to correct folder if indeed in wrong place.
/path/to/spam/message123 result="Innocent"; class="Innocent"; probability=0.0000; confidence=0.73 /path/to/ham/message234 result="Spam"; class="Spam"; probability=0.0005; confidence=0.61 |
If you move stuff around a lot, do a new learn and check.
If it keeps reporting some messages wrong, you can script some whitelist method to ignore certain files etc.