Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason] add age-grep step

...

We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.

4.1. filter out too-old logs

No Format

ssh spamassassin.zones.apache.org
cd /home/jm/ftp/spamassassin/masses    [or wherever]

./log-grep-recent -m 18 /home/corpus-rsync/corpus/submit/ham-*.log > ham.log

./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam.log

We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 18 months / 6 months seems a good start.

(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?(TODO: add a filtering step to remove "too-old" spam from the logs!)

5. generate scores for score sets

...