...
A smallish number of people (about 157), including some of the developers themselves, work as volunteer "corpus submitters". They hand-classify their mail and then run mass-check over it. They submit the output logs mass-check generates. Occasionally people review the submitted logs for obvious mistakes, but it is largely a trust system.
...
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/jm/ftp/spamassassin/masses [or wherever] ./log-grep-recent -m 1838 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log ./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log |
We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 18 38 months / 6 months seems a good startworked well for 3.2.0.
(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?)
4.2 tweak rules for perceptron
TODO: describe. this consists of removing sandbox rules, going through the rulesrc dir, comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules, and grepping for bad rules.
5. generate scores for score sets
...