Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Changed months used with log-grep-recent to what is currently being used in score generation

Rescore Mass-Check

(see RescoreMassCheck310 for the 3.1.x historical page or RescoreMassCheck320 for historical releases)

This is the procedure we use to generate new scores. It takes quite a while and is labour-intensive, so we do it infrequently.

...

Here's the process for generating the scores as of SpamAssassin 3.23.0:

1. heads-up

Inform everyone in advance on the users and dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean (see CorpusCleaning) and sign up for RsyncAccounts.

...

No Format
  masses/enable-all-evolved-rules < rules/50_scores.cf  \
                           > rules/51_newscores.cf
  mv rules/51_newscores.cf rules/50_scores.cf
  svn diff     [and ensure it looks sane]
  svn commit   [create a new bug attachment for review if in R-T-C mode]

...

No Format

ssh spamassassin.zones.apache.org
sudo cp /home/corpus-rsync/secrets /home/corpus-rsync/secrets-submit

Move the old rescore logs from the previous release (if they're still around) to the archives:

No Format
ssh spamassassin.zones.apache.org
cd /home/corpus-rsync
OLDVERSION="3.12"
sudo mv corpus/submit scoregen-$OLDVERSION
sudo mkdir corpus/submit
sudo chown rsync corpus/submit
sudo gtar cvfz ARCHIVE/scoregen-$OLDVERSION.tgz scoregen-$OLDVERSION

...

No Format
svn export http://svn.apache.org/repos/asf/spamassassin/trunk mcsnapshot
tar cvfz mcsnapshot.tgz mcsnapshot

svn cp \
        https://svn.apache.org/repos/asf/spamassassin/trunk \
        https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1

(we can't use the standard build process here anymore since the dist tarball no longer includes "masses". Use a descriptive, unique tag name.)

...

RescoreDetails is the full announcement text (and instructions) for this phase. It's sufficient just to send out a mail something like the one we used in 3.1.0previous releases:

No Format
To: users
Cc: dev
Subject: NOTICE: 3.23.0 rescoring mass-checks

OK, if you're planning to send us mass-check logs for the
3.23.0 rescoring, now's the time!

http://wiki.apache.org/spamassassin/RescoreDetails has all
the details.

cheers!

--j.

...

We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.

4.05. publish logs to ruleqa site

This will make the mass-check results visible on http://ruleqa.spamassassin.org/ (under the appropriate DateRev), using usernames starting with "rescore-". TODO: this doesn't include filtering out too-old logs (see below), so won't necessarily match the freqs produced later.

No Format

ssh spamassassin2.zones.apache.org
cd /export/home/corpus-rsync/corpus

echo '# mass-check results from someone@rescore, on Tue Sep 30 09:00:00 UTC 2009
# M:SA version 3.3.0-alpha3-r808953
# SVN revision: 808953
# Date: 20090930T090000Z
#' > /tmp/hdr

for f in submit/*.log ; do 
i=`echo $f | sed -e 's,^submit/,,' -e 's/^\(.*am\)-bayes-net-\([^\.]*\.log\)$/\1-rescore-\2/'`; echo "$f => $i" ; 
sudo touch tmpf ;
sudo chmod 666 tmpf;
cat < /tmp/hdr > tmpf;
sed -e '/^#/d' < $f >> tmpf;
sudo chmod 644 tmpf;
sudo mv tmpf $i ; sudo chown rsync $i; done

4.1. filter out too-old logs

No Format
ssh spamassassin.zones.apache.org
cd /home/jm/ftp/spamassassin/masses    [or wherever]

./log-grep-recent -m 3872 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log

./log-grep-recent -m 62 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log

We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 38 months / 6 months worked well for 3.23.0.

4.2 tweak rules for evolver

...

No Format
cd /path/to/checkout/of/trunk
svn co \
  https://svn.apache.org/repos/asf/spamassassin/tags/3_23_0_mcsnapshot_1/rules \
  rules-mcsnapshot
cp rules-mcsnapshot/active.list rules/active.list
make

...

See RunningGa. (in the past we used RunningPerceptron, but it acted up during 3.23.0 generation, so we used the GA again.)

...

No Format
sudo mkdir /home/corpus-rsync/ARCHIVE/3.23.0
sudo mv rescore-logs.tgz /home/corpus-rsync/ARCHIVE/3.23.0/rescore-logs-bug5270bug6155.tgz

6.5. mark evolved-score rules as 'always published'

...