Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

What?

Nightly MassCheck runs are currently the primary vehicle for evaluating the quality of rules checked into SpamAssassin. Every night contributors check out a specific revision of SpamAssassin from SVN and run MassCheck on their corpora. They upload their MassCheck logs to an rsync server, where lots of analysis takes place, visible through the RuleQaApp.

(There's also an older, clunkier version of the analysis scripts running on DanielQuinlan's server; see http://www.pathname.com/~corpus .)

There are three ways to do this; using a script we distribute, doing it yourself, or just uploading your corpus to our server.

How? (The Easiest Way)

If you rsync up your corpus to our server, as described in UploadedCorpora, it can be mass-checked there. Unfortunately you have to share your mail corpus with whoever might have access to that machine. It's not expected that anyone will ever actually look, but it's there nonetheless. If you are very concerned about privacy, you may be advised to strip out the more private mails before uploading, or mass-check on your own machine instead. (This is what I do --jm)

Details for PMC members on how to set up new accounts are at NewUploadedCorporaUser.

How? (Less Easy, The Corpus-Nightly Script)

The corpus-nightly script in the masses/rule-qa/ directory of the SpamAssassin tree can be used to set up a mass-checker on your mail. Here's a step-by-step account of the process.

First off, you'll also need to ask for RsyncAccounts and make sure you get a "nightly" account rather than a release-time account. You also need to install Subversion to get the "svn" command.

Then run:

No Format

mkdir $HOME/nightlymc $HOME/nightlymc/tmp
cd $HOME/nightlymc
svn co http://svn.apache.org/repos/asf/spamassassin/trunk
cp trunk/masses/rule-qa/corpus.example ~/.corpus

Edit '~/.corpus' to have values something like this, replacing /home/jm with whatever your own $HOME is.

No Format

vi ~/.corpus
# temporary working directory for summary results
tmp=/home/jm/nightlymc/tmp

# subversion directory location
# [this is the directory you have already checked out!]
tree=/home/jm/nightlymc/trunk

# rsync username and password (see RsyncAccounts)
username=jm
password=xyzzy

# weekly and nightly mass-check options
opts_weekly="--restart=500 --tail=15000 --net -j 8 -f /home/jm/nightlymc/targets"
opts_nightly="--restart=500 --tail=15000 -f /home/jm/nightlymc/targets"

# weekly and nightly mass-check user_prefs files
prefs_weekly=/home/jm/nightlymc/user_prefs.weekly
prefs_nightly=/home/jm/nightlymc/user_prefs.nightly

Now, create those two user_prefs files. Here's suggested (basic) settings:

user_prefs.nightly:

No Format

use_bayes 0
use_auto_whitelist 0
internal_networks 127/8
trusted_networks 127/8

I suggest just "cp"'ing that file to user_prefs.weekly as well, but if you wanted different settings to control network rules, go ahead. It might make sense to extend those with full trusted-networks data, if you like.

Edit ~/nightlymc/targets:

No Format

ham:detect:/local/cor/recent/ham/*
spam:detect:/local/cor/recent/spam/*

That's it – now run
bash /home/jm/nightlymc/trunk/masses/rule-qa/corpus-nightly and watch as it starts mass-checking. Once you're happy enough with it, set that command to run in cron.

Note: the best time to run a mass-check is as soon as possible after 0900 UTC. Daylight savings time in some local timezones can be troublesome, so the script will adjust for this by sleeping for an hour if it detects that it was started in the 0800 UTC hour period, so you no longer have to worry about that.

How? (For Hackers, The DIY Version)

Here's more detail on that process, if you don't want to use the "corpus-nightly" script.

Get ahold of http://rsync.spamassassin.org/$VERS-versions.txt, where
$VERS is either "nightly" or "weekly". "nightly" is updated a little before 0900 UTC Sunday through Friday. "weekly" is updated at the same time on Saturdays, and is meant to be a net-enabled run. ie: wait until at least 0900 UTC before trying to do a corpus run. The above files are also available via the standard rsync system.

Get a "nightly" rsync account (see 'How?' above).

The format of the above files is a file of "date <tab> revision <LF>", date in YYYY-MM-DD format, revision being the value that comes out of SVN. New lines are added to the bottom of the file.

So... Grab the file, find the right line (you can either grep for the date, or just take the last line of the file), and use the second column to update your corpora version. ie:

No Format

REV=`tail -1 nightly.txt | awk '{print $2}'`
cd /path/to/spamassassin-checkout
svn update -r $REV

Alternatively, if you would prefer to pick it up via rsync:

No Format

rsync -vrz --delete \
     rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check .

(replace "nightly" with "weekly" for the weekly builds.)

Then use that build of SpamAssassin to perform a MassCheck , and when that completes, upload the results as per the instructions in http://spamassassin.org/dist/masses/CORPUS_SUBMIT_NIGHTLY .

Note: The result log-files must have an SVN revision line in the output, like so:

No Format

# mass-check results from jm@jalapeno, on Mon Nov 21 09:10:15 UTC 2005
# M:SA version 3.2.0-r322462
# SVN revision: 345462
# Perl version: 5.008003 on i386-linux-thread-multi
# Switches: '--progress --tail=20000 -j 4 -f /home/jm/cor/tgts'

If that line isn't present, the rule-QA reporting system cannot correlate the logs with the source revision, and instead ignores them.

If you do not use SVN to retrieve the SpamAssassin source tree, this may not be present, since "mass-check" cannot use "svn info" to get the current revision data. However, there's a workaround. Before running "mass-check", run "svn info" and redirect the output into a file called "svninfo.tmp" in the "masses" directory. Mass-check will read that and use its data for the "SVN revision:" line.

the way people submit data on the effectiveness of current rules on their recent spam and ham. It is used to generate the very rule scores that determine the effectiveness of SpamAssassin (distributed via sa-update), and to evaluate rules via the RuleQaApp. The accuracy of SpamAssassin is directly related to the number of people contributing to nightly MassChecks.

This does not require sending us your email. Just logs of which rules hit your emails.

Usually a script is run from cron which automatically downloads the latest development version of SpamAssassin, runs it against your spam and ham, and then uploads a log of the results. One line per email, with a list of the SpamAssassin rules each email hit. Your actual email is not uploaded with this method.

An advantage to participating is that it makes SpamAssassin more accurate on your email. Even few hundred varied emails per month would be good help for the project. But please make sure you are committed to maintaining a clean corpus and willing to monitor RuleQA Mailing List for queries and updates to masscheck participants.

How?

  1. Send an email to private@spamassassin.apache.org requesting an rsync account for nightly mass-checks. It is helpful if you include a few sentences with your background and expertise for vetting purposes. NOTE: New masscheck contributors are now being accepted since about 2012-08-09.
  2. When your request is processed, you'll be notified and added to the RuleQA Mailing List for SpamAssassin.
  3. Download automasscheck-minimal.sh and automasscheck-minimal.cf.dist
  4. Local SpamAssassin installation is not needed nor will be used in any way, the script downloads a private one for it's own use. But make sure you have required perl modules installed.
  5. Copy automasscheck-minimal.sh to ~/bin/ or other suitable location.
  6. Copy automasscheck-minimal.cf.dist to ~/.automasscheck.cf (hardcoded location in script).
  7. Modify ~/.automasscheck.cf to point at your ham and spam folders. Be sure to configure properly for mbox (mbox) or Maildir (dir) folder formats. Leave the RSYNC options unchanged for now, because you will be running automasscheck in test mode at first. Set WORKDIR to suitable location.
  8. The masschecker is set to run 8 threads by default. Depending on your box's resources this could too much or too little. Modify your ~/.automasscheck.cf to change JOBS as needed.
  9. Set TRUSTED_NETWORKS and INTERNAL_NETWORKS properly in ~/.automasscheck.cf. Without this network tests might use the wrong relay.
  10. Ensure there is no router/firewall blocking connections to rsync.spamassassin.org port 873 (rsync protocol).
  11. Run automasscheck-minimal.sh.
    • Look in ~/masscheckwork/nightly_mass_check/masses/ for ham-*.log and spam-*.log files. (Or weekly_mass_check on Saturday.)
    • Are the filenames good? They should be named something like ham-username.log or ham-net-username.log.
    • Read CorpusCleaning and HandClassifiedCorpora for guidelines of how to identify ham in your spam folder, and spam in your ham folder, and which messages you should be simply deleted.
    • If you move/delete messages, do not forget to "Compact Folder" to be sure they are actually gone.
    • Repeat automasscheck until you are certain both folders are cleaned.
  12. Edit ~/.automasscheck.cf and set RSYNC_USERNAME and RSYNC_PASSWORD with values from step 1.
  13. Run automasscheck-minimal.sh, which will upload your results.
  14. Ask a more experienced participant (probably the person who recruited you) to check your results on the server. They can see the uploaded log files by running a command like rsync --old-d username@rsync.spamassassin.org::corpus/. You can also verify that your corpora show up on http://ruleqa.spamassassin.org/ - the green box near the top shows all included usernames.
  15. If your upload looks good, then you're probably ready to automate nightly checks. Configure automasscheck to run as a cron job as your non-root user at or after 9AM UTC. (After weekly-versions.txt / nightly-versions.txt gets updated in rsync.spamassassin.org::corpus . If you run it earlier it will break things. )

Alternative Methods

You can do it manually: ManualNightlyMassCheck (but you really should not, just for reference(The version of the tree available at rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check and .../weekly_mass_check already has this file included.)