Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason] update with current work

...

A smallish number of people (about 157), including some of the developers themselves, work as volunteer "corpus submitters". They hand-classify their mail and then run mass-check over it. They submit the output logs mass-check generates. Occasionally people review the submitted logs for obvious mistakes, but it is largely a trust system.

...

No Format
ssh spamassassin.zones.apache.org
cd /home/jm/ftp/spamassassin/masses    [or wherever]

./log-grep-recent -m 1838 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log

./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log

We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 18 38 months / 6 months seems a good startworked well for 3.2.0.

(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?)

4.2 tweak rules for perceptron

TODO: describe. this consists of removing sandbox rules, going through the rulesrc dir, comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules, and grepping for bad rules.

5. generate scores for score sets

...