As used in the RescoreMassCheck process.
Firstly, check that the rules and logs are both relatively clean and ready to use.
Copy/link the full source logs to "ham-full.log" and "spam-full.log" in the masses directory. Then:
cd masses make clean rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log svn revert ../rules/50_scores.cf ln -s ham-full.log ham.log ln -s spam-full.log spam.log make freqs SCORESET=3 less freqs |
Go through the HitFrequencies report in freqs and check:
Save a copy of freqs, then generate ranges:
cp freqs freqs.full make > make.out 2>&1 less tmp/ranges.data |
examine tmp/ranges.data and check:
To prepare your environment for running the rescorer:
rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log mkdir ORIG for CLASS in ham spam ; do ln $CLASS-full.log ORIG/$CLASS.log for I in 0 1 2 3 ; do ln -s $CLASS.log ORIG/$CLASS-set$I.log done done |
Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and execute the runGA script. runGA generates and uses a randomly selected corpus with 90% being used for training and 10% being used for testing.
You need to ensure an up-to-date version of perl is used. On the zone, this is /local/perl586.
export PATH=/local/perl586/bin:$PATH nohup bash runGA & tail -f nohup.out |
monitor progress... once the GA is compiled, and starts running, if the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA process and running a new one "by hand" with different switches:
./garescorer -b 5.0 -s 100 -t 5.0 |
if you do this though you will have to cut and paste the post-GA commands (in the "POST-GA COMMANDS" section of runGA) by hand!
Once the GA run is complete, and you're happy with the accuracy: You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".
Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores; gen-*/scores is the output from the perceptron, and should match within a few 0.1% to gen-*/test output (which is computed on a separate subset of the mail messages). This checks:
Once you're satisfied, check in ../rules/50_scores.cf. Copy the "config" file back to "config.setN" where "N" is the current scoreset, and check that in. Then, add a comment to the rescoring bugzilla bug, noting:
next, carry on with other steps from RescoreMassCheck (if that's what you're doing).
To get garescorer to build with the above "make > make.out 2>&1" command on an Ubuntu Maverick machine, I installed the libpgapack-serial1 package, and ran:
mkdir -p /local/pgapack-1.0.0.1/lib ln -s /usr/lib /local/pgapack-1.0.0.1/lib/sun4 mkdir -p /local/pgapack-1.0.0.1 ln -s /usr/include/pgapack-serial /local/pgapack-1.0.0.1/include |
The first symptom you are likely to see of this problem is the error:
time: cannot run ./garescorer: No such file or directory |
To take advantage of multiple CPU cores, use pgapack-mpi (which appears to be broken on ubuntu), and run it as:
mpirun -np 4 ./garescorer -b 10 -e 5500 -t 5.0 |
Replace "4" with your number of CPU cores. Although it looks like this causes redundant processing instead of distributed load.