Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Removing references to the long-obsolete GA

...

This "rebalancing" is done by the mass-check tests and genetic algorithm scoring run of SpamAssassin. This process takes about 4 weeks to complete for a new ruleset. This process heavily bogs down the computers of all the corpus submitters for days on end, particularly for the runs with network checks. The last time this was done 1,690,967 messages were processed (between the 4 sets combined), the hits of each and every message (not just aggregate hit counts per rule) were fed into a GAan error minimisation program, and the 4 scoresets were evolved.

And that's 4 weeks just for the mass-check and GA scoring process, and that's done after a set of rules are decided to be good based on some smaller-scale dry runs, developer debate, testing, tweaking, and hand analysis of various bits of spam and nonspam.

While this makes the whole genetic algorithm and mass-check and scoring process sound bad, those very aspects of SpamAssassin are what makes it so effective in the first place. They are what provides the "real world" feedback into the whole system. The reality of email is a complex knotted mess of human behaviours, something not easily characterized by simple "anything with this word/phrase must be spam". Using a system that analyzes real world email, and generates a set of scores that fit reality is very powerful. In some ways, it's a lot like crossing the best parts of human generated filter rules, with the statistical measurements of a bayes system (and making it even more deeply analyzed than simple probabilities can represent).

...