Rules Project: Streamlining the rules process
(part of RulesProjectPlan)
Problem description: 'People that do write rules for their own use are not willing to go through the fairly elaborate process in order to submit them to SpamAssassin (this currently requires rules to go through bugzilla and then through 70_testing.cf and eventually into our distribution). What can we do to make this process easier and more inviting?'
First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.
Rule Promotion
Getting rules from the sandbox, into the distribution:
- each user gets their own sandbox as discussed on RulesProjMoreInput
- checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
- to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
- also C-T-R to migrate from "sandbox" to "extra" ruleset
Rules that get promoted from a "sandbox" to "core" should pass the following criteria:
- pass "--lint"!
- S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
- > 0.25% of target type hit (e.g. spam for non-nice rules)
- < 1.00% of non-target type hit (e.g. ham for non-nice rules)
We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.
Future criteria:
- not too slow TODO: need an automated way to measure that
- TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
Getting There From Here
If we're going to start pulling rules from sandboxes into core/ in the above fashion, but we leave the current ruleset intact in the core as well, things will get messy.
I propose we move the current core ruleset into a sandbox, called 'rules/sandbox/legacy/'. The good rules that pass the above selection criteria, get promoted as any other rules from other sandboxes do, into the new 'core/'; the old, stale rules (of which we have a few), will not get back into core.
The 'extra/' Set
A ruleset in the "extra" set would have different criteria; e.g.
- the virus bounce ruleset
- rules that positively identify spam from spamware, but hit <0.25% of spam
- an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
(ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
Overlap Criteria
BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
BobMenschel: 'By "throw away anything where the overlap is less than 50%" I meant to discard (exclude from the final file) anything where the overlap was (IMO) insignificant. This would leave those overlaps where RULE_A hit all the emails that RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
of the rules that RULE_A hit.'
JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to keep the rules that do NOT have a high degree of overlap with other rules, and throw out the rules that do (because they're redundant). in other words, you want to throw away when the mutual overlap is greater than some high value (like 95% at a guess).