Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by HenrikKrohns]

...

To make corpus cleaning easier next time, you can save a list of emails that scored high that weren't spam, to automatically skip. When viewing emails as above, they have a "X-Mass-Check-Id:" header which lists the file they came from, which you can use to remove any email that was actually spam from the id.hi file. Then copy the id.hi file to something like grab the file names out of id.hi with awk, and make a backup copy:

No Format

awk {'print $3'} < id.hi > ~/sa/id.hi.good

Next and next time, run:

No Format
sort -rn -k 2 ham.log | fgrep -vf ~/sa/id.hi.good | head -n 200 > id.hi
./mboxget < id.hi > mbox
mutt -f mbox

...

  • MISSING_HEADERS: if a message doesn't have all the normal headers, such as From, To, and Subject, this will fire. Be sure to hand-verify any ham and spam messages that hit this to ensure that they're formatted correctly (in RFC-2822 format).
  • NO_HEADERS_MESSAGE (or a combo of MISSING_HEADERS,MISSING_DATE,MISSING_SUBJECT in versions < 3.2.0): generally means you've got message without most of the important RFC-822 headers (often errors generated by MUAs/MDAs).
  • EMPTY_MESSAGE: generally zero-length files, esp if accompanied by NO_RECEIVED.
  • MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow, or an mbox "From" line has been inserted between RFC-822 headers.
  • ANY_BOUNCE_MESSAGE: this indicates that the mail was a bounce message, a C/R challenge, or a "virus warning" from a broken scanner. These should be removed from both the ham and spam corpora, in general.

Other Corpus Cleaning Methods

DSPAM

DSPAM is well known standalone bayesian tool, you can crosscheck your corpus fast and easy with it.

It doesn't seem to be maintained anymore, here is probably the best version: https://github.com/ensc/dspam (download the master). If you are not comfortable compiling things, then you need to find some package.

Example how to build and install it simply in your home directory:

No Format

unzip master.zip && cd dspam-master
# autoconf/automake/gcc stuff obviously needed
./autogen.sh
./configure --prefix=$HOME/dspam --disable-trusted-user-security --disable-syslog
make && make install

This assumes your corpus is in Maildir format (file per message).

Make sure PATH includes $HOME/dspam/bin if installed there.

You can experiment with different learning methods. It's probably best to feed all manually verified messages first with --source=corpus. It's not an exact science, so mixing methods might come up with different FPs/FNs.

Learn the corpus (method 1):

No Format

# Always clear old data first
rm -rf $HOME/dspam/var
# This will learn the folders with --source=error
dspam_train $LOGNAME /path/to/spam /path/to/ham

Learn the corpus (method 2):

No Format

# Clear old data unless you are learning some additional corpus
rm -rf $HOME/dspam/var
# Feed your folders with --source=corpus
find /path/to/spam -type f | while read -r f; do
  dspam --user $LOGNAME --source=corpus --class=spam < "$f"
done
find /path/to/ham -type f | while read -r f; do
  dspam --user $LOGNAME --source=corpus --class=innocent < "$f"
done

Check the corpus:

No Format

/bin/bash
find /path/to/spam -type f | while read -r f; do
  RESULT=$(dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done
find /path/to/ham -type f | while read -r f; do
  RESULT=$(dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done

It will output list of messages to check. Move to correct folder if indeed in wrong place.

No Format

/path/to/spam/message123 result="Innocent"; class="Innocent"; probability=0.0000; confidence=0.73
/path/to/ham/message234 result="Spam"; class="Spam"; probability=0.0005; confidence=0.61

If you move stuff around a lot, do a new learn and check.

If it keeps reporting some messages wrong, you can script some whitelist method to ignore certain files etc.