Running the GA to generate scores
As used in the RescoreMassCheck process.
Firstly, check that the rules and logs are both relatively clean and ready to use.
Copy/link the full source logs to "ham-full.log" and "spam-full.log" in the masses directory. Then:
cd masses make clean rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log svn revert ../rules/50_scores.cf ln -s ham-full.log ham.log ln -s spam-full.log spam.log make freqs SCORESET=3 less freqs
Go through the HitFrequencies report in freqs and check:
- ALL_TRUSTED hitrate on spam. This should appear only in ham.
- unfamiliar rules with high ham hitrates; they could be easily forgeable. comment them or mark them "tflags nopublish".
- NO_RECEIVED hitrate in spam.
- NO_RELAYS hitrate in spam.
Save a copy of freqs, then generate ranges:
cp freqs freqs.full make > make.out 2>&1 less tmp/ranges.data
examine tmp/ranges.data and check:
- ranges that are 0.000 0.000 0 for no obvious reason;
- rules named with a "T_" prefix. These can sometimes slip through if used in promoted meta rules. They should be fixed to not include a "T_" prefix in the rulesrc source file. (that should be the only way that a T_ rule will appear in the output; "real" sandbox T_ rules should be removed already, since you deleted the sandbox rule file.)
To prepare your environment for running the rescorer:
rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log mkdir ORIG for CLASS in ham spam ; do ln $CLASS-full.log ORIG/$CLASS.log for I in 0 1 2 3 ; do ln -s $CLASS.log ORIG/$CLASS-set$I.log done done
Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and execute the runGA script. runGA generates and uses a randomly selected corpus with 90% being used for training and 10% being used for testing.
You need to ensure an up-to-date version of perl is used. On the zone, this is /local/perl586.
export PATH=/local/perl586/bin:$PATH nohup bash runGA & tail -f nohup.out
monitor progress... once the GA is compiled, and starts running, if the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA process and running a new one "by hand" with different switches:
./garescorer -b 5.0 -s 100 -t 5.0
if you do this though you will have to cut and paste the post-GA commands (in the "POST-GA COMMANDS" section of runGA) by hand!
Once the GA run is complete, and you're happy with the accuracy: You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".
Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores; gen-*/scores is the output from the perceptron, and should match within a few 0.1% to gen-*/test output (which is computed on a separate subset of the mail messages). This checks:
- that the mail messages are diverse enough to avoid overfitting (hence the different test and train sets)
- that the FP%/FN% computations are not losing precision due to C-vs-Perl floating-point bugs, or a differing idea of what rules are promoted vs not promoted between the C and Perl code.
Once you're satisfied, check in ../rules/50_scores.cf. Copy the "config" file back to "config.setN" where "N" is the current scoreset, and check that in. Then, add a comment to the rescoring bugzilla bug, noting:
- the "gen-*/test" file contents, with FP%/FN% rate
- the "gen-*" path for later reference
next, carry on with other steps from RescoreMassCheck (if that's what you're doing).
To get garescorer to build with the above "make > make.out 2>&1" command on an Ubuntu Maverick machine, I installed the libpgapack-serial1 package, and ran:
mkdir -p /local/pgapack-22.214.171.124/lib ln -s /usr/lib /local/pgapack-126.96.36.199/lib/sun4 mkdir -p /local/pgapack-188.8.131.52 ln -s /usr/include/pgapack-serial /local/pgapack-184.108.40.206/include
The first symptom you are likely to see of this problem is the error:
time: cannot run ./garescorer: No such file or directory
To take advantage of multiple CPU cores, use pgapack-mpi (which appears to be broken on ubuntu), and run it as:
mpirun -np 4 ./garescorer -b 10 -e 5500 -t 5.0
Replace "4" with your number of CPU cores. Although it looks like this causes redundant processing instead of distributed load.