You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Rules Project: Promotion of Rules

from the sandboxes to the core ruleset, that is.

(this page split from RulesProjSandboxes, part of RulesProjectPlan)

JustinMason: note that SVN paths are listed as "ROOT/rules/trunk". This is the trunk; by having that, it allows branches of the rules project at e.g. "ROOT/rules/branches/vX.Y.Z", similarly to how the code SVN repo has trunk and branches. (As to what way exactly we'd branch, versions, etc. let's see how that develops in the future.)

Getting rules from the sandbox, into the distribution:

  • each user gets their own sandbox as discussed on RulesProjSandboxes
  • checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
  • to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
  • also C-T-R to migrate from "sandbox" to "extra" ruleset

Rules that get promoted from a "sandbox" to "core" should pass the following criteria:

  • pass "--lint"!
  • S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
  • > 0.25% of target type hit (e.g. spam for non-nice rules)
  • < 1.00% of non-target type hit (e.g. ham for non-nice rules)

These numbers are really just ball-park figures and should be fine-tuned as we go. (DuncanFindlay)

We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

Future criteria:

  • not too slow (wink) TODO: need an automated way to measure that
  • TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

Moving files out of trunk into the new rules project

JustinMason: If we're going to start pulling rules from sandboxes into core/ in the above fashion, but we leave the current ruleset intact in the core as well, things will get messy. I propose we move the current core ruleset into a sandbox, called 'rules/sandbox/legacy/'. The good rules that pass the above selection criteria, get promoted as any other rules from other sandboxes do, into the new 'core/'; the old, stale rules (of which we have a few), will not get back into core.

DanielQuinlan: vetoed. Instead: code-tied rules stay with main tree in current rules directory, with the exception of 25_replace.cf which is really just another way to write body/header rules. Basically, the static stuff that is tied to code does not move to the rules project.

In more detail – files that DO NOT move to rules project:

      25_accessdb.cf    (plugins in core code)
      25_antivirus.cf
      25_dcc.cf
      25_domainkeys.cf
      25_hashcash.cf
      25_pyzor.cf
      25_razor2.cf
      25_spf.cf
      25_textcat.cf
      25_uribl.cf
      60_awl.cf
      60_whitelist_subject.cf
      20_dnsbl_tests.cf (eval tests in EvalTests.pm)
      20_html_tests.cf (rawbody ones can move to ROOT/rules/trunk/core/)
      20_net_tests.cf
      23_bayes.cf
      60_whitelist.cf
      init.pre          (Misc non-cf files)
      local.cf
      name-triplets.txt
      regression_tests.cf
      triplets.txt
      user_prefs.template
      v310.pre

Files that DO get moved:

   25_body_tests_es.cf -> ROOT/rules/trunk/core/es/
   25_body_tests_pl.cf -> ROOT/rules/trunk/core/pl/
   30_text_de.cf       -> ROOT/rules/trunk/core/de/
   30_text_fr.cf       -> ROOT/rules/trunk/core/fr/
   30_text_it.cf       -> ROOT/rules/trunk/core/it/
   30_text_nl.cf       -> ROOT/rules/trunk/core/nl/
   30_text_pl.cf       -> ROOT/rules/trunk/core/pl/
   30_text_pt_br.cf    -> ROOT/rules/trunk/core/pt_br/

   20_advance_fee.cf   -> ROOT/rules/trunk/core/
   20_drugs.cf         -> ROOT/rules/trunk/core/
   20_p**n.cf          -> ROOT/rules/trunk/core/    [wikicensorship!]

   10_misc.cf           -> ROOT/rules/trunk/core/
   20_anti_ratware.cf   -> ROOT/rules/trunk/core/
   20_body_tests.cf     -> ROOT/rules/trunk/core/
   20_compensate.cf     -> ROOT/rules/trunk/core/
   20_fake_helo_tests.cf -> ROOT/rules/trunk/core/
   20_head_tests.cf     -> ROOT/rules/trunk/core/
   20_meta_tests.cf     -> ROOT/rules/trunk/core/
   20_phrases.cf        -> ROOT/rules/trunk/core/
   20_ratware.cf        -> ROOT/rules/trunk/core/
   20_uri_tests.cf      -> ROOT/rules/trunk/core/
   25_replace.cf        -> ROOT/rules/trunk/core/ [code dependent, but these will change a lot]
   50_scores.cf         -> ROOT/rules/trunk/core/
   60_whitelist_spf.cf  -> ROOT/rules/trunk/core/

Files that get deleted: 20_anti_ratware.cf: it's empty. [DONE]

JustinMason: ok, that looks good – except for one thing. We still have the problem that ROOT/rules/trunk/core/ is going to be a mix of legacy files and auto-promoted rules. What do we do about that problem?

DanielQuinlan: the auto-promoted .cf file should be 100% machine generated and overwritten each night (or whatever the period is). Once a rule is promoted into core, it'll disappear from the auto-promoted file because (a) overlap test dictates so or (b) the non-core file that contained the file will no longer contain it (or we could use a comment, rename the rule, etc. to indicate that it is no longer a candidate for auto-promotion if the author wants to keep it around).

JustinMason: update – here's the script that will be run to perform these renames: http://taint.org/xfer/2005/svnrenames

Algorithm for auto-promotion

JustinMason: Aside from the criteria, we also need an idea of how the config file lines get from sandbox to core. Here's my proposal.

For each sandbox directory:

  • iterate through all files in the dir
  • if a config line refers to a rule name (e.g. "header", "describe", "tflags"), then:
    • apply the validation criteria. if the rule passes:
      • output the line
    • else:
      • ignore the line and produce no output
  • if the config line doesn't refer to a rule name, output the line.
  • send that output to a file in ROOT/rules/trunk/core/ , named according to the sandbox directory's name. e.g. lines from all files matching ROOT/rules/trunk/sandbox/jmason/*.cf would be output to ROOT/rules/trunk/core/25_jmason.cf

DanielQuinlan: we'll need to work on the naming

The validation criteria

So, initially, I had this marked as "the criteria from Rule Promotion", above. Hwoever, that didn't make sense; one aim of having a 'compiler' for this stuff was to avoid "flapping" when rules would pass criteria one day and fail the next, falling into and out of the distributable ruleset. This would happen using those criteria, as they're FP%/FN%-based.

On review, this isn't what we'd initially discussed on IRC, and didn't make sense; I'd oversimplified during transcribing.

Instead the plan we'd agreed was to compile the rules files from the source dir to the output dir, and select rules which were marked as "promoted" in their source files.

The mark in question is through a build command in the source file, something like:

    publish 1

(suggestions welcome...)

other build commands include:

  • a command to select the name of the output file in the 'rules' output directory
  • No labels