Rules Project: Streamlining the rules process

(part of RulesProjectPlan)

Problem description: 'People that do write rules for their own use are not willing to go through the fairly elaborate process in order to submit them to SpamAssassin (this currently requires rules to go through bugzilla and then through 70_testing.cf and eventually into our distribution). What can we do to make this process easier and more inviting?'

First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.

Rule Promotion

Getting rules from the sandbox, into the distribution:

each user gets their own sandbox as discussed on RulesProjMoreInput
checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
also C-T-R to migrate from "sandbox" to "extra" ruleset

Rules that get promoted from a "sandbox" to "core" should pass the following criteria:

pass "--lint"!
S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
> 0.25% of target type hit (e.g. spam for non-nice rules)
< 1.00% of non-target type hit (e.g. ham for non-nice rules)

We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

Future criteria:

not too slow TODO: need an automated way to measure that
TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

Getting There From Here

If we're going to start pulling rules from sandboxes into core/ in the above fashion, but we leave the current ruleset intact in the core as well, things will get messy.

I propose we move the current core ruleset into a sandbox, called 'rules/sandbox/legacy/'. The good rules that pass the above selection criteria, get promoted as any other rules from other sandboxes do, into the new 'core/'; the old, stale rules (of which we have a few), will not get back into core.

The 'extra/' Set

A ruleset in the "extra" set would have different criteria; e.g.

the virus bounce ruleset
rules that positively identify spam from spamware, but hit <0.25% of spam
an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham

(ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)

Overlap Criteria

BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'

BobMenschel: 'By "throw away anything where the overlap is less than 50%" I meant to discard (exclude from the final file) anything where the overlap was (IMO) insignificant. This would leave those overlaps where RULE_A hit all the emails that RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
of the rules that RULE_A hit.'

JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to keep the rules that do NOT have a high degree of overlap with other rules, and throw out the rules that do (because they're redundant). in other words, you want to throw away when the mutual overlap is greater than some high value (like 95% at a guess).

Child pages

RulesProjStreamlining

Rules Project: Streamlining the rules process

Rule Promotion

Getting There From Here

The 'extra/' Set

Overlap Criteria