Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by HenrikKrohns]

...

This assumes your corpus is in Maildir format (file per message).

Make sure PATH includes $HOME/dspam/bin if installed there.

You can experiment with different learning methods. It's probably best to feed all manually verified messages first with --source=corpus. It's not an exact science, so mixing methods might come up with different FPs/FNs.

Learn the corpus (method 1):

No Format
# Always clear old data first
rm -rf $HOME/dspam/var
# This will learn the folders with --source=error
dspam_train $LOGNAME /path/to/spam /path/to/ham

Learn the corpus (method 2):

No Format

# Clear old data unless you are learning some additional corpus
rm -rf $HOME/dspam/bin/dspam_train $LOGNAMEvar
# Feed your folders with --source=corpus
find /path/to/spam -type f | while read -r f; do
  dspam --user $LOGNAME --source=corpus --class=spam < "$f"
done
find /path/to/ham -type f | while read -r f; do
  dspam --user $LOGNAME --source=corpus --class=innocent < "$f"
done

Check the corpus:

No Format
/bin/bash
find /path/to/spam -type f | while read -r f; do
  RESULT=$($HOME/dspam/bin/dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done
find /path/to/ham -type f | while read -r f; do
  RESULT=$($HOME/dspam/bin/dspam --user $LOGNAME --classify < "$f")
  # Tune confidence >= 0.6 check if needed
  if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then
    echo "$f ${BASH_REMATCH[1]}"
  fi
done

...