Rules Project: Sandboxes

Initially, the rules project sandboxes SVN will consist of the existing (empty) rules directory in Subversion (the CVS replacement used by the ASF). Each committer will have their own sandbox to begin development in an unconstrained manner:

rules/sandbox/<username>/

Every person who has listed their rule set on the Apache SpamAssassin Wiki will be invited once the PMC approves the project; there are some rule sets only listed at SARE or Exit0, but those people are invited to join too, of course. There is absolutely no quality or experience requirement for the sandbox although we may later provide some tools to make it easier to avoid name collisions and such.

It is expected that someone (don't know and don't care who) will eventually write scripts to test, filter, and pull rules automatically into the production rules. I am intentionally deferring decisions around that area, though.

What does providing a sandbox for everyone do?

easy to join (you just have to sign a CLA and get an @apache.org account)
no expectation of... well, much anything; no quality or experience requirement for the sandbox
easy for us to import rules (manually or automatically) into main rule body
easy to move forward with further development around automatic updates and all of the other (hard) ideas we've talked about, but I really want to keep this dirt simple.
ability to help direct future development of the rules project (as it extends beyond sandboxes, sandboxes will remain just sandboxes, of course).
can produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.
uses SVN and therefore has version control

In other words, this solves the main part of our "rules problem" – the hurdle of getting rules "over the wall". No longer will we need individual bugs for rule submissions, or need to go to 3 different sites to look for rule ideas, etc. Many of our best rules have come from SARE and the Wiki.

Also, it's expected that many of the rules will never go into the main rules body – someone may write rules for a specific type of annoying mail (not even necessarily spam), or maybe someone will be focused on super-aggressive rules for the brave folks out there. We can even produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.

Some notes picked out of followup discussion:

It is possible to keep rules 'private', and in your own checkout only, by not checking them into SVN.

If you do want to have the rules visible for collaboration, but not used for automatic mass-checks or promotion, that could be done by just keeping them in a file that doesn't end in ".cf". (SpamAssassin's standard is that they have to end in ".cf" to be considered valid rules files.)

Repository Organization

rules/core/ = standard rules directory
rules/sandbox/<username>/ = per-user sandboxes
rules/extra/<directory>/ = extra rule sets not in core

The proposal is for rules/core to become the rules directory for trunk (3.2 and later, via SVN externals which will make their inclusion seamless in the standard SA tree). The sandbox is discussed further in RulesProjMoreInput.

Extras/

We'll want to discuss the structure and process behind creating new extras directories further once we reach a critical mass of committers in the rules project; but here's some initial thoughts on typical 'extra' rulesets.

'Aggressive' rulesets, which are too likely to produce FPs for the base release
non-spam-oriented rules, such as the anti-virus-bounce ruleset
non-English-language rulesets (although see RulesNotEnglish)

Rule Promotion

Getting rules from the sandbox, into the distribution:

each user gets their own sandbox as discussed on RulesProjSandboxes
checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
also C-T-R to migrate from "sandbox" to "extra" ruleset

Rules that get promoted from a "sandbox" to "core" should pass the following criteria:

pass "--lint"!
S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
> 0.25% of target type hit (e.g. spam for non-nice rules)
< 1.00% of non-target type hit (e.g. ham for non-nice rules)

These numbers are really just ball-park figures and should be fine-tuned as we go. (DuncanFindlay)

We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

Future criteria:

not too slow TODO: need an automated way to measure that
TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

Getting There From Here

Moving files out of trunk into the new rules project

JustinMason: If we're going to start pulling rules from sandboxes into core/ in the above fashion, but we leave the current ruleset intact in the core as well, things will get messy. I propose we move the current core ruleset into a sandbox, called 'rules/sandbox/legacy/'. The good rules that pass the above selection criteria, get promoted as any other rules from other sandboxes do, into the new 'core/'; the old, stale rules (of which we have a few), will not get back into core.

DanielQuinlan: vetoed. Instead: code-tied rules stay with main tree in current rules directory, with the exception of 25_replace.cf which is really just another way to write body/header rules. Basically, the static stuff that is tied to code does not move to the rules project.

In more detail – files that DO NOT move to rules project:

      25_accessdb.cf    (plugins in core code)
      25_antivirus.cf
      25_dcc.cf
      25_domainkeys.cf
      25_hashcash.cf
      25_pyzor.cf
      25_razor2.cf
      25_spf.cf
      25_textcat.cf
      25_uribl.cf
      60_awl.cf
      60_whitelist_subject.cf
      20_dnsbl_tests.cf (eval tests in EvalTests.pm)
      20_html_tests.cf (rawbody ones can move to ROOT/rules/core/)
      20_net_tests.cf
      23_bayes.cf
      60_whitelist.cf
      init.pre          (Misc non-cf files)
      local.cf
      name-triplets.txt
      regression_tests.cf
      triplets.txt
      user_prefs.template
      v310.pre

Files that DO get moved:

   25_body_tests_es.cf -> ROOT/rules/lang/es/
   25_body_tests_pl.cf -> ROOT/rules/lang/pl/
   30_text_de.cf       -> ROOT/rules/lang/de/
   30_text_fr.cf       -> ROOT/rules/lang/fr/
   30_text_it.cf       -> ROOT/rules/lang/it/
   30_text_nl.cf       -> ROOT/rules/lang/nl/
   30_text_pl.cf       -> ROOT/rules/lang/pl/
   30_text_pt_br.cf    -> ROOT/rules/lang/pt_br/

   20_advance_fee.cf   -> ROOT/rules/core/
   20_drugs.cf         -> ROOT/rules/core/
   20_p**n.cf          -> ROOT/rules/core/    [wikicensorship!]

   10_misc.cf           -> ROOT/rules/core/
   20_anti_ratware.cf   -> ROOT/rules/core/
   20_body_tests.cf     -> ROOT/rules/core/
   20_compensate.cf     -> ROOT/rules/core/
   20_fake_helo_tests.cf -> ROOT/rules/core/
   20_head_tests.cf     -> ROOT/rules/core/
   20_meta_tests.cf     -> ROOT/rules/core/
   20_phrases.cf        -> ROOT/rules/core/
   20_ratware.cf        -> ROOT/rules/core/
   20_uri_tests.cf      -> ROOT/rules/core/
   25_replace.cf (odd case, but will change a lot) -> ROOT/rules/core/
   50_scores.cf         -> ROOT/rules/core/
   60_whitelist_spf.cf  -> ROOT/rules/core/

Files that get deleted: 20_anti_ratware.cf: it's empty.

JustinMason: ok, that looks good – except for one thing. We still have the problem that ROOT/rules/core/ is going to be a mix of legacy files and auto-promoted rules. What do we do about that problem?

Algorithm for auto-promotion

JustinMason: Aside from the criteria, we also need an idea of how the config file lines get from sandbox to core. Here's my proposal.

For each sandbox directory:

iterate through all files in the dir
if a config line refers to a rule name (e.g. "header", "describe", "tflags"), then:
- apply the criteria from 'Rule Promotion'. if the rule passes:
  - output the line
- else:
  - ignore the line and produce no output
if the config line doesn't refer to a rule name, output the line.
send that output to a file in ROOT/rules/core/ , named according to the sandbox directory's name. e.g. lines from all files matching ROOT/rules/sandbox/jmason/*.cf would be output to ROOT/rules/core/25_jmason.cf

The 'extra/' Set

A ruleset in the "extra" set would have different criteria; e.g.

the virus bounce ruleset
rules that positively identify spam from spamware, but hit <0.25% of spam
an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham

(ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)

Overlap Criteria

BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'

BobMenschel: 'By "throw away anything where the overlap is less than 50%" I meant to discard (exclude from the final file) anything where the overlap was (IMO) insignificant. This would leave those overlaps where RULE_A hit all the emails that RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
of the rules that RULE_A hit.'

JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to keep the rules that do NOT have a high degree of overlap with other rules, and throw out the rules that do (because they're redundant). in other words, you want to throw away when the mutual overlap is greater than some high value (like 95% at a guess).

Child pages

RulesProjSandboxes