Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by HenrikKrohns]

...

  • MISSING_HEADERS: if a message doesn't have all the normal headers, such as From, To, and Subject, this will fire. Be sure to hand-verify any ham and spam messages that hit this to ensure that they're formatted correctly (in RFC-2822 format).
  • NO_HEADERS_MESSAGE (or a combo of MISSING_HEADERS,MISSING_DATE,MISSING_SUBJECT in versions < 3.2.0): generally means you've got message without most of the important RFC-822 headers (often errors generated by MUAs/MDAs).
  • EMPTY_MESSAGE: generally zero-length files, esp if accompanied by NO_RECEIVED.
  • MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow, or an mbox "From" line has been inserted between RFC-822 headers.
  • ANY_BOUNCE_MESSAGE: this indicates that the mail was a bounce message, a C/R challenge, or a "virus warning" from a broken scanner. These should be removed from both the ham and spam corpora, in general.

Other Corpus Cleaning Methods

DSPAM

DSPAM is well known standalone bayesian tool, you can crosscheck your corpus fast and easy with it.

It doesn't seem to be maintained anymore, here is probably the best version: https://github.com/ensc/dspam (download the master). If you are not comfortable compiling things, then you need to find some package.

Example how to build and install it simply in your home directory:

No Format

unzip master.zip && cd dspam-master
# autoconf/automake/gcc stuff obviously needed
./autogen.sh
./configure --prefix=$HOME/dspam --with-dspam-home=$HOME/dspam_data \
  --disable-trusted-user-security --disable-syslog
make && make install

This assumes your corpus is in Maildir format (file per message).

Learn the corpus:

No Format

# Always clear old data first
rm -rf $HOME/dspam_data
$HOME/dspam/bin/dspam_train $LOGNAME /path/to/spam /path/to/ham

Check the corpus:

No Format

/bin/bash
find /path/to/spam -type f | while read -r f; do
  RESULT=$(dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done
find /path/to/ham -type f | while read -r f; do
  RESULT=$(dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done

It will output list of messages to check. Move to correct folder if indeed in wrong place.

No Format

/path/to/spam/message123 result="Innocent"; class="Innocent"; probability=0.0000; confidence=0.73
/path/to/ham/message234 result="Spam"; class="Spam"; probability=0.0005; confidence=0.61

If you move stuff around a lot, do a new learn and check.

If it keeps reporting some messages wrong, you can script some whitelist method to ignore certain files etc.