You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Rescore Mass-Check

This is the procedure we use to generate new scores. It takes quite a while and is labour-intensive, so we do it infrequently.

We generate new scores by analyzing a massive collection of mail (a "corpus"), and running software to create a score-set that gets the best possible set of scores, so that the maximum possible number of mails in that corpus are correctly classified (ie. so that SA thinks the ham messages are nearly all ham, and the spam messages are nearly all spam).

Summary

The corpus consists of many (approximately 1 million pieces) of real-world, hand sorted mail.

Basically a smallish number of people (about 15), including the developers themselves, work as volunteer "corpus submitters". They hand classify their mail and then run mass-check over it. They submit the output logs mass-check generates. Occasionally people review the submitted logs for obvious mistakes, but it is largely a trust system.

If you want to see the statistics from the last corpus run, check the STATISTICS.txt files that come in the SA tarball. It will tell you how many emails were used, and what the hit rates of all the rules were.

Procedure

Here's the process for generating the scores as of SpamAssassin 3.1.0:

1. heads-up

Inform everyone in advance on the -users and -dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean (see CorpusCleaning) and sign up for RsyncAccounts.

Enable all rules using the helper script to do this:

  masses/enable-all-evolved-rules < rules/50_scores.cf  \
                           > rules/51_newscores.cf
  mv rules/51_newscores.cf rules/50_scores.cf
  svn diff     [and ensure it looks sane]
  svn commit   [create a new bug attachment for review if in R-T-C mode]

Build a prerelease tarball using build/update_stable. See build/README for details on the build process.

2. announce mass-check

RescoreDetails is the full announcement text (and instructions) for this phase. It's sufficient just to send out a mail something like the one we used in 3.1.0:

To: users
Cc: dev
Subject: NOTICE: 3.1.0 rescoring mass-checks

OK, if you're planning to send us mass-check logs for the 3.1.0
rescoring, now's the time!

http://wiki.apache.org/spamassassin/RescoreDetails has all the
details.

cheers!

--j.

We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.

3. allow several days to complete (it takes a really long time!)

Provide enough time, including a weekend if possible, giving people enough time to get around to running it given that they may be busy with day-job stuff. (wink)

4. generate scores for score sets

See RunningPerceptron.

Once this is complete, update rules/50_scores.cf with the generated scores.

  • No labels