Rescore Mass-Check
(see RescoreMassCheck310 for the 3.1.x historical page or RescoreMassCheck320 for historical releases)
This is the procedure we use to generate new scores. It takes quite a while and is labour-intensive, so we do it infrequently.
...
Here's the process for generating the scores as of SpamAssassin 3.23.0:
1. heads-up
Inform everyone in advance on the users and dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean (see CorpusCleaning) and sign up for RsyncAccounts.
...
No Format |
---|
masses/enable-all-evolved-rules < rules/50_scores.cf \ > rules/51_newscores.cf mv rules/51_newscores.cf rules/50_scores.cf svn diff [and ensure it looks sane] svn commit [create a new bug attachment for review if in R-T-C mode] |
Copy the nightly-log-submission rsync accounts to the rescore-log-submission accounts (see RsyncConfig) (not clear why we don't just use one set of accounts here, but hey):
No Format |
---|
ssh spamassassin.zones.apache.org
sudo cp /home/corpus-rsync/secrets /home/corpus-rsync/secrets-submit
|
Move the old rescore logs from the previous release (if they're still around) to the archives:
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/corpus-rsync OLDVERSION="3.12" sudo mv corpus/submit scoregen-$OLDVERSION sudo mkdir corpus/submit sudo chown rsync corpus/submit sudo gtar cvfz ARCHIVE/scoregen-$OLDVERSION.tgz scoregen-$OLDVERSION |
...
No Format |
---|
svn export http://svn.apache.org/repos/asf/spamassassin/trunk mcsnapshot tar cvfz mcsnapshot.tgz mcsnapshot svn cp \ https://svn.apache.org/repos/asf/spamassassin/trunk \ https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1 |
(we can't use the standard build process here anymore since the dist tarball no longer includes "masses". Use a descriptive, unique tag name.)
...
RescoreDetails is the full announcement text (and instructions) for this phase. It's sufficient just to send out a mail something like the one we used in 3.1.0previous releases:
No Format |
---|
To: users Cc: dev Subject: NOTICE: 3.23.0 rescoring mass-checks OK, if you're planning to send us mass-check logs for the 3.23.0 rescoring, now's the time! http://wiki.apache.org/spamassassin/RescoreDetails has all the details. cheers! --j. |
...
We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.
4.05. publish logs to ruleqa site
This will make the mass-check results visible on http://ruleqa.spamassassin.org/ (under the appropriate DateRev), using usernames starting with "rescore-". TODO: this doesn't include filtering out too-old logs (see below), so won't necessarily match the freqs produced later.
No Format |
---|
ssh spamassassin2.zones.apache.org cd /export/home/corpus-rsync/corpus echo '# mass-check results from someone@rescore, on Tue Sep 30 09:00:00 UTC 2009 # M:SA version 3.3.0-alpha3-r808953 # SVN revision: 808953 # Date: 20090930T090000Z #' > /tmp/hdr for f in submit/*.log ; do i=`echo $f | sed -e 's,^submit/,,' -e 's/^\(.*am\)-bayes-net-\([^\.]*\.log\)$/\1-rescore-\2/'`; echo "$f => $i" ; sudo touch tmpf ; sudo chmod 666 tmpf; cat < /tmp/hdr > tmpf; sed -e '/^#/d' < $f >> tmpf; sudo chmod 644 tmpf; sudo mv tmpf $i ; sudo chown rsync $i; done |
4.1. filter out too-old logs
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/jm/ftp/spamassassin/masses [or wherever] ./log-grep-recent -m 3872 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log ./log-grep-recent -m 62 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log |
We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 38 months / 6 months worked well for 3.23.0.
4.2 tweak rules for evolver
...
No Format |
---|
cd /path/to/checkout/of/trunk svn co \ https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1/rules \ rules-mcsnapshot cp rules-mcsnapshot/active.list rules/active.list make |
...
See RunningGa. (in the past we used RunningPerceptron, but it acted up during 3.23.0 generation, so we used the GA again.)
...
No Format |
---|
cd masses tar cvfz rescore-logs.tgz gen-set{0,1,2,3}-* |
...
(use "gtar" on the solaris zone.)
These can be pretty big (although nowadays the scripts using hard links for the duplicate logfiles, which saves a lot of space).
Also, check in the "config" files you used for each scoreset:
No Format |
---|
svn commit -m "runGA config files used" masses/config.set*
|
6. upload the test logs to zone
Since stuff like the STATISTICS cannot ever be regenerated without the (randomised) test logs, these need to be saved, too. Currently, I think the best bet is to upload the rescore-logs.tgz
file somewhere on spamassassin.zones.apache.org; it doesn't have to be in a public place, ASF-committer-account-required is fine. Just mention that path in the rescoring bug's comments. last time, I did this:
No Format |
---|
sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0
sudo mv rescore-logs.tgz /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tgz
|
6.5. mark evolved-score rules as 'always published'
Normally, rules in the sandbox are promoted to the "active" 72_active.cf ruleset, or demoted to the "test" 70_sandbox.cf ruleset, based on their accuracy in the nightly mass-checks. However, now that the evolver has assigned scores for them, they need to be always published regardless of how they might do in the previous night's checks. Run:
No Format |
---|
cd masses
./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf
svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf
|
6.6. fix test failures
Run prove -v t/basic_lint.t
and prove -v t/meta.t
. Manually edit the rules files to fix any test failures caused by the new scores. For example, some meta rules may now depend on rules that have been assigned a score of 0; either make those rules into __SUBRULES
, or give them a score of 0.001.
7. upload proposed new scores
...
No Format |
---|
svn revert rules/50_scores.cf wget -o newscores.diff http://bugzilla.spamassassin.org/....attachment?id=.... patch -p0 < newscores.diff |
then, a little configuration; replace these with the paths to the correct gen-setN-* directories for the 4 score sets... the test logs the stats are measured against will be taken from these directories. NOTE: don't cut and paste these! they will be different for your runs.
No Format |
---|
genset0=/home/corpus-rsync/corpus/scoregen-3.1/gen-set0-2.0-4.0-100-nobob
genset1=/home/corpus-rsync/corpus/scoregen-3.1/gen-set1-2.0-4.0-100-nobob
genset2=/home/corpus-rsync/corpus/scoregen-3.1/gen-set2-2.0-4.625-100-nobob
genset3=/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob
|
Once those vars are set, run Run these commands:
No Format |
---|
cd masses rmcp ham*config.log spam*.log ; touch ham.log spam.log ln -s $genset0/NSBASE/ham-test.log ham-test.log ln -s $genset0/SPBASE/spam-test.log spam-test.log set0 config ; bash ./mk-baseline-results 0 > ../rules/STATISTICS-set0.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset1/NSBASE/ham-test.log ham-test.log ln -s $genset1/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 1 > ../rules/STATISTICS-set1.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset2/NSBASE/ham-test.log ham-test.log ln -s $genset2/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 2 > ../rules/STATISTICS-set2.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset3/NSBASE/ham-test.log ham-test.log ln -s $genset3/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 3 > ../rules/STATISTICS-set3.txtrunGA stats cp config.set1 config ; bash ./runGA stats cp config.set2 config ; bash ./runGA stats cp config.set3 config ; bash ./runGA stats |
There'll be a lot of output along these lines:
No Format |
---|
ignoring 'TO_ADDRESS_EQ_REAL': immutable and score == 0 |
But that can be ignored. (TODO: it'd be nice to make this step a little less labour-intensive.)
8. upload new stats files
...
And let all and sundry vote on that, too (or just check it in depending on whether you're in R-T-C or not). Once the new scores and STATS files are approved and into SVN, and the log data is in a safe archival spot on the zone, the bugzilla bug notes that location, and the "config" files are checked in, you're done.