Using mass-check To Test Rules
"mass-check" is a tool included with the SpamAssassin source distribution in the [wiki:MassesOverview 'masses' directory] to test rules for accuracy and hit-rate. If you're writing custom rules, you really should use this to test them.
First, you need HandClassifiedCorpora. Let's say that's made up of two maildir folders, "/path/to/ham" and "/path/to/spam".
Next, cd into the "masses" directory of the source distribution:
cd masses ./mass-check --progress \ ham:dir:/path/to/ham \ spam:dir:/path/to/spam
This will create two files, "ham.log" and "spam.log" containing the hitting rules, read from the rules dir "../rules" as they are applied to that corpus. Each line of the two log files represents details about one email message, and there's a line for every message.
mass-check also takes other options to control whether network tests are run, whether multiple processes are run in parallel, how the output is presented, etc.; read the comments at the top of the file for details. Here's some key bits:
Using network tests
For mass-checks for scoresets 1 or 3, using network tests, you need to provide the --net
switch. Ensure Net::DNS, Mail::SPF::Query, Razor, Pyzor and DCC are installed.
Network tests are slow unless you use the -j switch to allow mass-check to start multiple parallel scanning processes.
Using Bayes
This is controlled using the mass-check configuration file. Do this:
cd masses mkdir spamassassin rm spamassassin/bayes* echo "use_bayes 1" > spamassassin/user_prefs
Once mass-check completes
The next step is to run hit-frequencies: see HitFrequencies for details.
Usage
usage:[BR] mass-check [options] target ...
-c=file |
set configuration/rules directoryBR |
-p=dir |
set user-prefs directoryBR |
-f=file |
read list of targets from <file>BR |
-j=jobs |
specify the number of processes to run simultaneouslyBR |
--net |
turn on network checks!BR |
--mid |
report Message-ID from each messageBR |
--debug |
report debugging informationBR |
--progress |
show progress updates during checkBR |
--rewrite=OUT |
save rewritten message to OUT (default is /tmp/out)BR |
--showdots |
print a dot for each scanned messageBR |
--rules=RE |
Only test rules matching the given regexp REBR |
--restart=N |
restart all of the children after processing N messagesBR |
--deencap=RE |
Extract SpamAssassin-encapsulated spam mails only if they were encapsulated by servers matching the regexp RE (default = extract all SpamAssassin-encapsulated mails) |
log optionsBR
-o |
write all logs to stdoutBR |
--loghits |
log the text hit for patterns (useful for debugging)BR |
--loguris |
log the URIs foundBR |
--hamlog=log |
use <log> as ham log ('ham.log' is default)BR |
--spamlog=log |
use <log> as spam log ('spam.log' is default)BR |
message selection optionsBR
-n |
no date sorting or spam/ham interleavingBR |
--after=N |
only test mails received after time_t N (negative values are an offset from current time, e.g. -86400 = last day) or after date as parsed by Time::Parsedate (e.g. '-6 months') BR |
--before=N |
same as --after, except received times are before time_t N BR |
--cache |
Use cached information about atime (generates files in corpus area)BR |
--all |
don't skip big messages BR |
--head=N |
only check first N ham and N spam (N messages if -n used) BR |
--tail=N |
only check last N ham and N spam (N messages if -n used) BR |
simple target options (implies -o and no ham/spam classification) BR
--dir |
subsequent targets are directories BR |
--file |
subsequent targets are files in RFC 822 format BR |
--mbox |
subsequent targets are mbox files BR |
--mbx |
subsequent targets are mbx files BR |
Just left over functions we should remove at some point: BR
--bayes |
report score from Bayesian classifier BR |
non-option arguments are used as target names (mail files and folders), the target format is: <class>:<format>:<location> BR
class |
is "spam" or "ham" BR |
format |
is "dir", "file", "mbx", or "mbox" BR |
location |
is a file or directory name. Globbing of ~ and * is supported. BR |