Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: converted to 1.6 markup

The SpamAssassin Challenge

Wiki Markup(THIS IS A DRAFT; see \[http://issues.apache.org/SpamAssassin/show_bug .cgi?id= 5376 bug 5376 for discussion\])

Wiki MarkupThe \[http://www.netflixprize.com/ Netflix Prize\] is a machine-learning challenge from Netflix which 'seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.'

We in SpamAssassin have similar problems; maybe we can solve them in a similar way. We have:

...

Input: the test data: mass-check logs

Wiki MarkupWe will take the [SpamAssassin] 3.2.0 mass-check logs, and split them into test and training sets; 90% for training, 10% for testing, is traditional. Any cleanups that we had to do during \[http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5270 bug 5270\] are re-applied.

The test set is saved, and not published.

...

Listing the rule name, and score, one per line, for each mutable rule. We can then use the "masses/rewrite-cf-with-new-scores" script to insert those scores into our own scores files, and test FP% / FN% rates with our own test set of logs.

Runtime Limits

The code needs to be "fire and forget" automated; hand-tweaking of settings must not be necessary. It must be possible to just give it the "rules_N.pl", let it gronk, and get the scores output.

Evaluation Criteria

TODO, still talking about this

Licensing

The code produced would need to be usable without any sort of patent license, and available under the Apache Software License 2.0 (ie. suitable for inclusion in the Apache SpamAssassin source tree).