Rescore Mass-Check
(see RescoreMassCheck310 for the 3.1.x historical page or RescoreMassCheck320 for historical releases)
This is the procedure we use to generate new scores. It takes quite a while and is labour-intensive, so we do it infrequently.
...
Here's the process for generating the scores as of SpamAssassin 3.23.0:
1. heads-up
Inform everyone in advance on the users and dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean (see CorpusCleaning) and sign up for RsyncAccounts.
...
No Format |
---|
masses/enable-all-evolved-rules < rules/50_scores.cf \
> rules/51_newscores.cf
mv rules/51_newscores.cf rules/50_scores.cf
svn diff [and ensure it looks sane]
svn commit [create a new bug attachment for review if in R-T-C mode]
|
Copy the nightly-log-submission rsync accounts to the rescore-log-submission accounts (see RsyncConfig) (not clear why we don't just use one set of accounts here, but hey):
No Format |
---|
ssh spamassassin.zones.apache.org sudo cp /home/corpus-rsync/secrets /home/corpus-rsync/secrets-submit |
Move the old rescore logs from the previous release (if they're still around) to the archives:
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/corpus-rsync OLDVERSION="3.12" sudo mv corpus/submit scoregen-$OLDVERSION sudo mkdir corpus/submit sudo chown rsync corpus/submit sudo gtar cvfz ARCHIVE/scoregen-$OLDVERSION.tgz scoregen-$OLDVERSION |
...
No Format |
---|
svn export http://svn.apache.org/repos/asf/spamassassin/trunk mcsnapshot
tar cvfz mcsnapshot.tgz mcsnapshot
svn cp \
https://svn.apache.org/repos/asf/spamassassin/trunk \
https://svn.apache.org/repos/asf/spamassassin/tags/3_3_0_mcsnapshot_1
|
(we can't use the standard build process here anymore since the dist tarball no longer includes "masses". Use a descriptive, unique tag name.)
2. announce mass-check
RescoreDetails is the full announcement text (and instructions) for this phase. It's sufficient just to send out a mail something like the one we used in 3.1.0previous releases:
No Format |
---|
To: users Cc: dev Subject: NOTICE: 3.23.0 rescoring mass-checks OK, if you're planning to send us mass-check logs for the 3.23.0 rescoring, now's the time! http://wiki.apache.org/spamassassin/RescoreDetails has all the details. cheers! --j. |
...
We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.
4.05. publish logs to ruleqa site
This will make the mass-check results visible on http://ruleqa.spamassassin.org/ (under the appropriate DateRev), using usernames starting with "rescore-". TODO: this doesn't include filtering out too-old logs (see below), so won't necessarily match the freqs produced later.
No Format |
---|
ssh spamassassin2.zones.apache.org
cd /export/home/corpus-rsync/corpus
echo '# mass-check results from someone@rescore, on Tue Sep 30 09:00:00 UTC 2009
# M:SA version 3.3.0-alpha3-r808953
# SVN revision: 808953
# Date: 20090930T090000Z
#' > /tmp/hdr
for f in submit/*.log ; do
i=`echo $f | sed -e 's,^submit/,,' -e 's/^\(.*am\)-bayes-net-\([^\.]*\.log\)$/\1-rescore-\2/'`; echo "$f => $i" ;
sudo touch tmpf ;
sudo chmod 666 tmpf;
cat < /tmp/hdr > tmpf;
sed -e '/^#/d' < $f >> tmpf;
sudo chmod 644 tmpf;
sudo mv tmpf $i ; sudo chown rsync $i; done
|
4.1. filter out too-old logs
No Format |
---|
ssh spamassassin.zones.apache.org cd /home/jm/ftp/spamassassin/masses [or wherever] ./log-grep-recent -m 3872 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log ./log-grep-recent -m 62 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log |
We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 38 months / 6 months worked well for 3.23.0.
(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?)
4.2 tweak rules for
...
evolver
Go TODO: describe. this consists of removing sandbox rules, going through the rulesrc dir, comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules, and grepping for bad rules..
4.3 resync to mcsnapshot rules list
Resync the active rules list to the "active" set as it was in the mass-check snapshot, required since rules/active.list is regenerated every night! Note: if you've made changes to the ruleset that mean you can't use the same set of active rules now, you have a big problem...
We don't just use the entire "rules" dir as it was back then, since every time we run the evolver we first have to fix a few minor bugs in the "rules" files – e.g. scores marked as unmutable in 50_scores.cf incorrectly etc.
No Format |
---|
cd /path/to/checkout/of/trunk
svn co \
https://svn.apache.org/repos/asf/spamassassin/tags/3_3_0_mcsnapshot_1/rules \
rules-mcsnapshot
cp rules-mcsnapshot/active.list rules/active.list
make
|
Remove the sandbox ruleset so the evolver doesn't trust them:
No Format |
---|
mv rules/70_sandbox.cf 70_sandbox_off.cf
|
5. generate scores for score sets
See RunningGa. (in the past we used RunningPerceptron, but it acted up during 3.3.0 generation, so we used the GA again.)
Once this is complete, rules/50_scores.cf will have the generated scores, created by runGA. (TODO: I think.)
...
No Format |
---|
cd masses tar cvfz rescore-logs.tgz gen-set{0,1,2,3}-* |
...
(use "gtar" on the solaris zone.)
These can be pretty big (although nowadays the scripts using hard links for the duplicate logfiles, which saves a lot of space).
Also, check in the "config" files you used for each scoreset:
No Format |
---|
svn commit -m "runGA config files used" masses/config.set*
|
6. upload the test logs to zone
Since stuff like the STATISTICS cannot ever be regenerated without the (randomised) test logs, these need to be saved, too. Currently, I think the best bet is to upload the rescore-logs.tgz
file somewhere on spamassassin.zones.apache.org; it doesn't have to be in a public place, ASF-committer-account-required is fine. Just mention that path in the rescoring bug's comments. last time, I did this:
No Format |
---|
sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0 sudo mv rescore-logs.tgz /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tgz |
6.5. mark evolved-score rules as 'always published'
Normally, rules in the sandbox are promoted to the "active" 72_active.cf ruleset, or demoted to the "test" 70_sandbox.cf ruleset, based on their accuracy in the nightly mass-checks. However, now that the evolver has assigned scores for them, they need to be always published regardless of how they might do in the previous night's checks. Run:
No Format |
---|
cd masses
./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf
svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf
|
6.6. fix test failures
Run prove -v t/basic_lint.t
and prove -v t/meta.t
. Manually edit the rules files to fix any test failures caused by the new scores. For example, some meta rules may now depend on rules that have been assigned a score of 0; either make those rules into __SUBRULES
, or give them a score of 0.001.
7. upload proposed new scores
...
No Format |
---|
svn revert rules/50_scores.cf wget -o newscores.diff http://bugzilla.spamassassin.org/....attachment?id=.... patch -p0 < newscores.diff |
then, a little configuration; replace these with the paths to the correct gen-setN-* directories for the 4 score sets... the test logs the stats are measured against will be taken from these directories. NOTE: don't cut and paste these! they will be different for your runs.
No Format |
---|
genset0=/home/corpus-rsync/corpus/scoregen-3.1/gen-set0-2.0-4.0-100-nobob
genset1=/home/corpus-rsync/corpus/scoregen-3.1/gen-set1-2.0-4.0-100-nobob
genset2=/home/corpus-rsync/corpus/scoregen-3.1/gen-set2-2.0-4.625-100-nobob
genset3=/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob
|
Once those vars are set, run Run these commands:
No Format |
---|
cd masses rmcp ham*config.log spam*.logset0 config ; touch ham.log spam.log ln -s $genset0/NSBASE/ham-test.log ham-test.log ln -s $genset0/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 0 > ../rules/STATISTICS-set0.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset1/NSBASE/ham-test.log ham-test.log ln -s $genset1/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 1 > ../rules/STATISTICS-set1.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset2/NSBASE/ham-test.log ham-test.log ln -s $genset2/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 2 > ../rules/STATISTICS-set2.txt rm ham*.log spam*.log ; touch ham.log spam.log ln -s $genset3/NSBASE/ham-test.log ham-test.log ln -s $genset3/SPBASE/spam-test.log spam-test.log bash ./mk-baseline-results 3 > ../rules/STATISTICS-set3.txtrunGA stats cp config.set1 config ; bash ./runGA stats cp config.set2 config ; bash ./runGA stats cp config.set3 config ; bash ./runGA stats |
There'll be a lot of output along these lines:
No Format |
---|
ignoring 'TO_ADDRESS_EQ_REAL': immutable and score == 0 |
But that can be ignored. (TODO: it'd be nice to make this step a little less labour-intensive.)
8. upload new stats files
...
And let all and sundry vote on that, too (or just check it in depending on whether you're in R-T-C or not). Once the new scores and STATS files are approved and into SVN, and the log data is in a safe archival spot on the zone, the bugzilla bug notes that location, and the "config" files are checked in, you're done.