The SpamAssassin Challenge
(THIS IS A DRAFT; see \[http://issues.apache.org/SpamAssassin/show_bug .cgi?id= 5376 bug 5376 for discussion\]) Wiki Markup
The \[http://www.netflixprize.com/ Netflix Prize\] is a machine-learning challenge from Netflix which 'seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.' Wiki Markup
We in SpamAssassin have similar problems; maybe we can solve them in a similar way. We have:
...
Input: the test data: mass-check logs
We will take the [SpamAssassin] 3.2.0 mass-check logs, and split them into test and training sets; 90% for training, 10% for testing, is traditional. Any cleanups that we had to do during \[http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5270 bug 5270\] are re-applied. Wiki Markup
The test set is saved, and not published.
...
Listing the rule name, and score, one per line, for each mutable rule. We can then use the "masses/rewrite-cf-with-new-scores" script to insert those scores into our own scores files, and test FP% / FN% rates with our own test set of logs.
Runtime Limits
The code needs to be "fire and forget" automated; hand-tweaking of settings must not be necessary. It must be possible to just give it the "rules_N.pl", let it gronk, and get the scores output.
Evaluation Criteria
TODO, still talking about this
Licensing
The code produced would need to be usable without any sort of patent license, and available under the Apache Software License 2.0 (ie. suitable for inclusion in the Apache SpamAssassin source tree).