Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: converted to 1.6 markup

Running the GA to generate scores

As used in the RescoreMassCheck process.

Firstly, check that the rules and logs are both relatively clean and ready to use.

Copy/link the full source logs to "ham-full.log" and "spam-full.log" in the masses directory. Then:

No Format
cd masses

make clean
rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log
svn revert ../rules/50_scores.cf
ln -s ham-full.log ham.log
ln -s spam-full.log spam.log
make freqs SCORESET=3
less freqs

Go through the HitFrequencies report in freqs and check:

  • ALL_TRUSTED hitrate on spam. This should appear only in ham.
  • unfamiliar rules with high ham hitrates; they could be easily forgeable. comment them or mark them "tflags nopublish".
  • NO_RECEIVED hitrate in spam.
  • NO_RELAYS hitrate in spam.

Save a copy of freqs, then generate ranges:

No Format
cp freqs freqs.full
make > make.out 2>&1
less tmp/ranges.data

examine tmp/ranges.data and check:

  • ranges that are 0.000 0.000 0 for no obvious reason;
  • rules named with a "T_" prefix. These can sometimes slip through if used in promoted meta rules. They should be fixed to not include a "T_" prefix in the rulesrc source file. (that should be the only way that a T_ rule will appear in the output; "real" sandbox T_ rules should be removed already, since you deleted the sandbox rule file.)

To prepare your environment for running the rescorer:

No Format
  rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log
  mkdir ORIG
  for CLASS in ham spam ; do
  ln $CLASS-full.log ORIG/$CLASS.log
  for I in 0 1 2 3 ; do
  ln -s $CLASS.log ORIG/$CLASS-set$I.log
  done
  done

Score generation

Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and execute the runGA script. runGA generates and uses a randomly selected corpus with 90% being used for training and 10% being used for testing.

You need to ensure an up-to-date version of perl is used. On the zone, this is /local/perl586.

No Format
export PATH=/local/perl586/bin:$PATH
nohup bash runGA &
tail -f nohup.out

monitor progress... once the GA is compiled, and starts running, if the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA process and running a new one "by hand" with different switches:

No Format
./garescorer -b 5.0 -s 100 -t 5.0

if you do this though you will have to cut and paste the post-GA commands (in the "POST-GA COMMANDS" section of runGA) by hand!

Once the GA run is complete, and you're happy with the accuracy: You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".

Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores; gen-*/scores is the output from the perceptron, and should match within a few 0.1% to gen-*/test output (which is computed on a separate subset of the mail messages). This checks:

  • that the mail messages are diverse enough to avoid overfitting (hence the different test and train sets)
  • that the FP%/FN% computations are not losing precision due to C-vs-Perl floating-point bugs, or a differing idea of what rules are promoted vs not promoted between the C and Perl code.

Once you're satisfied, check in ../rules/50_scores.cf. Copy the "config" file back to "config.setN" where "N" is the current scoreset, and check that in. Then, add a comment to the rescoring bugzilla bug, noting:

  • the "gen-*/test" file contents, with FP%/FN% rate
  • the "gen-*" path for later reference

next, carry on with other steps from RescoreMassCheck (if that's what you're doing).