Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • easy to join (you just have to sign a CLA and get an @apache.org account)
  • no expectation of... well, much anything; no quality or experience requirement for the sandbox
  • easy for us to import rules (manually or automatically) into main rule body
  • easy to move forward with further development around automatic updates and all of the other (hard) ideas we've talked about, but I really want to keep this dirt simple.
  • ability to help direct future development of the rules project (as it extends beyond sandboxes, sandboxes will remain just sandboxes, of course).
  • can produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.
  • uses SVN and therefore has version control

In other words, this solves the main part of our "rules problem" – the hurdle of getting rules "over the wall". No longer will we need individual bugs for rule submissions, or need to go to 3 different sites to look for rule ideas, etc. Many of our best rules have come from SARE and the Wiki.

...

  • rules/core/ = standard rules directory
  • rules/sandbox/<username>/ = per-user sandboxes
  • rules/extra/<directory>/ = extra rule sets not in core

The proposal is for rules/core/ to become the rules directory for trunk (3.2 and later, via SVN externals which will make their inclusion seamless in the standard SA tree). The sandbox is discussed further in RulesProjMoreInput.

Promotion of rules from sandbox to rules/core/ is discussed in RulesProjPromotion.

(Update: bug 5123 gets rid of rules/core – now that is simply part of the old "rules" directory to keep things simpler.)

Extras/

We'll want to discuss the structure and process behind creating new extras directories further once we reach a critical mass of committers in the rules project; but here's some initial thoughts on typical 'extra' rulesets.

  • 'Aggressive' rulesets, which are too likely to produce FPs for the base release
  • non-spam-oriented rules, such as the anti-virus-bounce ruleset
  • non-English-language rulesets (although see RulesNotEnglish)

Rule Promotion

Getting rules from the sandbox, into the distribution:

  • each user gets their own sandbox as discussed on RulesProjSandboxes
  • checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
  • to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
  • also C-T-R to migrate from "sandbox" to "extra" ruleset

Rules that get promoted from a "sandbox" to "core" should pass the following criteria:

  • pass "--lint"!
  • S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
  • > 0.25% of target type hit (e.g. spam for non-nice rules)
  • < 1.00% of non-target type hit (e.g. ham for non-nice rules)

These numbers are really just ball-park figures and should be fine-tuned as we go. (DuncanFindlay)

We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

Future criteria:

  • not too slow (wink) TODO: need an automated way to measure that
  • TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

...

Moving files out of trunk into the new rules project

...

  • rules

...

  • that

...

DanielQuinlan: vetoed. Instead: code-tied rules stay with main tree in current rules directory, with the exception of 25_replace.cf which is really just another way to write body/header rules. Basically, the static stuff that is tied to code does not move to the rules project.

In more detail – files that DO NOT move to rules project:

No Format

      25_accessdb.cf    (plugins in core code)
      25_antivirus.cf
      25_dcc.cf
      25_domainkeys.cf
      25_hashcash.cf
      25_pyzor.cf
      25_razor2.cf
      25_spf.cf
      25_textcat.cf
      25_uribl.cf
      60_awl.cf
      60_whitelist_subject.cf
      20_dnsbl_tests.cf (eval tests in EvalTests.pm)
      20_html_tests.cf (rawbody ones can move to ROOT/rules/core/)
      20_net_tests.cf
      23_bayes.cf
      60_whitelist.cf
      init.pre          (Misc non-cf files)
      local.cf
      name-triplets.txt
      regression_tests.cf
      triplets.txt
      user_prefs.template
      v310.pre

Files that DO get moved:

No Format

   25_body_tests_es.cf -> ROOT/rules/lang/es/
   25_body_tests_pl.cf -> ROOT/rules/lang/pl/
   30_text_de.cf       -> ROOT/rules/lang/de/
   30_text_fr.cf       -> ROOT/rules/lang/fr/
   30_text_it.cf       -> ROOT/rules/lang/it/
   30_text_nl.cf       -> ROOT/rules/lang/nl/
   30_text_pl.cf       -> ROOT/rules/lang/pl/
   30_text_pt_br.cf    -> ROOT/rules/lang/pt_br/

   20_advance_fee.cf   -> ROOT/rules/core/
   20_drugs.cf         -> ROOT/rules/core/
   20_p**n.cf          -> ROOT/rules/core/    [wikicensorship!]

   10_misc.cf           -> ROOT/rules/core/
   20_anti_ratware.cf   -> ROOT/rules/core/
   20_body_tests.cf     -> ROOT/rules/core/
   20_compensate.cf     -> ROOT/rules/core/
   20_fake_helo_tests.cf -> ROOT/rules/core/
   20_head_tests.cf     -> ROOT/rules/core/
   20_meta_tests.cf     -> ROOT/rules/core/
   20_phrases.cf        -> ROOT/rules/core/
   20_ratware.cf        -> ROOT/rules/core/
   20_uri_tests.cf      -> ROOT/rules/core/
   25_replace.cf (odd case, but will change a lot) -> ROOT/rules/core/
   50_scores.cf         -> ROOT/rules/core/
   60_whitelist_spf.cf  -> ROOT/rules/core/

Files that get deleted: 20_anti_ratware.cf: it's empty.

JustinMason: ok, that looks good – except for one thing. We still have the problem that ROOT/rules/core/ is going to be a mix of legacy files and auto-promoted rules. What do we do about that problem?

Algorithm for auto-promotion

JustinMason: Aside from the criteria, we also need an idea of how the config file lines get from sandbox to core. Here's my proposal.

For each sandbox directory:

  • iterate through all files in the dir
  • if a config line refers to a rule name (e.g. "header", "describe", "tflags"), then:
    • apply the criteria from 'Rule Promotion'. if the rule passes:
      • output the line
    • else:
      • ignore the line and produce no output
  • if the config line doesn't refer to a rule name, output the line.
  • send that output to a file in ROOT/rules/core/ , named according to the sandbox directory's name. e.g. lines from all files matching ROOT/rules/sandbox/jmason/*.cf would be output to ROOT/rules/core/25_jmason.cf

The 'extra/' Set

A ruleset in the "extra" set would have different criteria; e.g.

  • the virus bounce ruleset
  • rules that positively identify spam from spamware, but hit <0.25% of spam
  • an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham

(ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

...

JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to keep the rules that do NOT have a high degree of overlap with other rules, and throw out the rules that do (because they're redundant). in other words, you want to throw away when the mutual overlap is greater than some high value (like 95% at a guess).

Again, this is something we can handle further down the line.