You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Rescore Mass-Check Instructions

Here's the procedure you'll need to follow, if you wish to submit data for the rescoring run for 3.1.0 using MassCheck:

Clean up the corpus of mail you intend to MassCheck (see CorpusCleaning), and get an rsync account (see RsyncAccounts). The latter can be done while mass-check is running, btw, it's not needed until the end; and the 'checking for false positives and false negatives' stage of corpus cleaning can be done afterwards as well.

It's helpful, but not required, to have some or all of the helper applications installed:

  • the Mail::SPF::Query module
  • the Net::DNS module
  • Pyzor

If you're running nightly mass-checks, please feel free to disable them when running the rescore mass-check runs. Also, please note that the nightly submission accounts will work for rescore submissions as well.

Then run these commands:

  wget http://people.apache.org/~jm/devel/Mail-SpamAssassin-3.1.0-pre2.tar.gz
  tar xvfz Mail-SpamAssassin-3.1.0-pre2.tar.gz
  cd Mail-SpamAssassin-3.1.0
  perl Makefile.PL < /dev/null
  make

  cd masses
  mkdir spamassassin
  rm -f spamassassin/*
  echo "bayes_auto_learn 0" > spamassassin/user_prefs
  echo "lock_method flock" >> spamassassin/user_prefs
  echo "bayes_store_module Mail::SpamAssassin::BayesStore::SDBM" >> spamassassin/user_prefs
  echo "use_auto_whitelist 0" >> spamassassin/user_prefs

  nohup ./mass-check --bayes --net -j 4 --restart=400 --learn=35 --reuse \
        --after=1072933200 <targets>

<targets> is the list of directories, mboxes, etc., like
spam:dir:~/Mail/spam. See the comments at the top of "mass-check" for details.

This takes *ages* to run. -j 4 controls the number of processes to use; 4 should be OK for a single-processor machine, since most of the time they'll be waiting for network results to arrive. If you have adequate RAM and don't mind the load, you can use -j 6 or -j 8. There's not much benefit in going higher than -j 8.

The --after=1072933200 option tells mass-check to ignore messages older than 18 months ago (in this case January 1 2004). This is useful if your corpus has older messages intermingled with your newer messages.

If you have an unusual network layout, you may need to specify
trusted_networks and/or internal_networks in the
spamassassin/user_prefs file. But SA should be able to infer it in most cases. If you get less than a 10% or 15% spam hit rate for RCVD_IN_XBL, then you might need to use these configuration parameters.

Once it finishes:

  USER="[whatever your username is]"
  RSYNC_PASSWORD="[whatever your password is]"
  export RSYNC_PASSWORD

  rsync -CPcvuzb ham.log $USER@rsync.spamassassin.org::submit/ham-bayes-net-$USER.log
  rsync -CPcvuzb spam.log $USER@rsync.spamassassin.org::submit/spam-bayes-net-$USER.log

That's it!

The results for this run will need to be in by Wednesday July 6th. If you're still running then, submit what you have so far and beg for more time. (wink)

  • No labels