...
We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.
4.1. filter out too-old logs
No Format |
---|
ssh spamassassin.zones.apache.org
cd /home/jm/ftp/spamassassin/masses [or wherever]
./log-grep-recent -m 18 /home/corpus-rsync/corpus/submit/ham-*.log > ham.log
./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam.log
|
We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 18 months / 6 months seems a good start.
(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?(TODO: add a filtering step to remove "too-old" spam from the logs!)
5. generate scores for score sets
...