Rescore Mass-Check
(see RescoreMassCheck310 for the 3.1.x historical page or RescoreMassCheck320 for historical releases)
This is the procedure we use to generate new scores. It takes quite a while and is labour-intensive, so we do it infrequently.
...
Here's the process for generating the scores as of SpamAssassin 3.23.0:
1. heads-up
Inform everyone in advance on the users and dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean (see CorpusCleaning) and sign up for RsyncAccounts.
...
No Format |
---|
masses/enable-all-evolved-rules < rules/50_scores.cf \ > rules/51_newscores.cf mv rules/51_newscores.cf rules/50_scores.cf svn diff [and ensure it looks sane] svn commit [create a new bug attachment for review if in R-T-C mode] |
...
No Format |
---|
ssh spamassassin.zones.apache.org
sudo cp /home/corpus-rsync/secrets /home/corpus-rsync/secrets-submit
|
Move the old rescore logs from the previous release (if they're still around) to the archives:
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/corpus-rsync OLDVERSION="3.12" sudo mv corpus/submit scoregen-$OLDVERSION sudo mkdir corpus/submit sudo chown rsync corpus/submit sudo gtar cvfz ARCHIVE/scoregen-$OLDVERSION.tgz scoregen-$OLDVERSION |
...
No Format |
---|
svn export http://svn.apache.org/repos/asf/spamassassin/trunk mcsnapshot tar cvfz mcsnapshot.tgz mcsnapshot svn cp \ https://svn.apache.org/repos/asf/spamassassin/trunk \ https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1 |
(we can't use the standard build process here anymore since the dist tarball no longer includes "masses". Use a descriptive, unique tag name.)
...
RescoreDetails is the full announcement text (and instructions) for this phase. It's sufficient just to send out a mail something like the one we used in 3.1.0previous releases:
No Format |
---|
To: users Cc: dev Subject: NOTICE: 3.23.0 rescoring mass-checks OK, if you're planning to send us mass-check logs for the 3.23.0 rescoring, now's the time! http://wiki.apache.org/spamassassin/RescoreDetails has all the details. cheers! --j. |
...
We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.
4.05. publish logs to ruleqa site
This will make the mass-check results visible on http://ruleqa.spamassassin.org/ (under the appropriate DateRev), using usernames starting with "rescore-". TODO: this doesn't include filtering out too-old logs (see below), so won't necessarily match the freqs produced later.
No Format |
---|
ssh spamassassin2.zones.apache.org cd /export/home/corpus-rsync/corpus echo '# mass-check results from someone@rescore, on Tue Sep 30 09:00:00 UTC 2009 # M:SA version 3.3.0-alpha3-r808953 # SVN revision: 808953 # Date: 20090930T090000Z #' > /tmp/hdr for f in submit/*.log ; do i=`echo $f | sed -e 's,^submit/,,' -e 's/^\(.*am\)-bayes-net-\([^\.]*\.log\)$/\1-rescore-\2/'`; echo "$f => $i" ; sudo touch tmpf ; sudo chmod 666 tmpf; cat < /tmp/hdr > tmpf; sed -e '/^#/d' < $f >> tmpf; sudo chmod 644 tmpf; sudo mv tmpf $i ; sudo chown rsync $i; done |
4.1. filter out too-old logs
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/jm/ftp/spamassassin/masses [or wherever] ./log-grep-recent -m 3872 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log ./log-grep-recent -m 62 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log |
We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 38 months / 6 months worked well for 3.23.0.
4.2 tweak rules for evolver
...
No Format |
---|
cd /path/to/checkout/of/trunk svn co \ https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1/rules \ rules-mcsnapshot cp rules-mcsnapshot/active.list rules/active.list make |
...
See RunningGa. (in the past we used RunningPerceptron, but it acted up during 3.23.0 generation, so we used the GA again.)
...
No Format |
---|
sudo mkdir /home/corpus-rsync/ARCHIVE/3.23.0 sudo mv rescore-logs.tgz /home/corpus-rsync/ARCHIVE/3.23.0/rescore-logs-bug5270bug6155.tgz |
6.5. mark evolved-score rules as 'always published'
...