...

"mass-check" is a tool included with the SpamAssassin source distribution in the 'masses' directory, which can be found in the SVN repository, to test rules for accuracy and hit-rate. If you're writing custom rules, you really should use this to test them.

First, you need HandClassifiedCorpora. Let's say that's divided into two maildir made up of two mbox folders, "/path/to/ham" and "/path/to/spam".

...

No Format
cd masses ./mass-check --progress \ ham:dirmbox:/path/to/ham \ spam:dirmbox:/path/to/spam

This will create two files, "ham.log" and "spam.log" containing hit-rates from the hitting rules in , read from the rules dir "../rules" as they are applied to that corpus. Each line of the two log files represents details about one email message, and there's a line for every message.

(mass-check also takes other options to control whether network tests are run, whether multiple processes are run in parallel, how the output is presented, etc.; read the comments at the top of the file for details. )

Using hit-frequencies

Next, to turn that into a freqs report, run this command:

No Format
make clean make freqs

Here's some key bits:

Configuration File

Mass-check reads a "user_prefs" file in "spamassassin/user_prefs". You need to create this yourself, it will not be created for you.

To test your own rules, you'll need to put them in this file, and include a line containing "allow_user_rules 1"

Using network tests

For mass-checks for scoresets 1 or 3, using network tests, you need to provide the --net switch. Ensure Net::DNS, Mail::SPF, Mail::DKIM (at least 0.31, preferably 0.36_5 or later), Razor (InstallingRazor), Pyzor (InstallingPyzor) and DCC (InstallingDCC) are installed.

Network tests are slow unless you use the -j switch to allow mass-check to start multiple parallel scanning processes.

Using Bayes

This is controlled using the mass-check configuration file. Do That will take "ham.log" and "spam.log" and generate a "freqs" file from the data. This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham. Its format looks like this:

No Format

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAMEcd masses
   6317     2614     3703    0.414   0.00    0.00  (all messages)
100.000  41.3804  58.6196    0.414   0.00    0.00  (all messages as %)
  2.153   5.2028   0.0000    1.000   1.00    4.30  RCVD_IN_OPM_HTTP
  1.219   0.0000   2.0794    0.000   1.00   -0.10  RCVD_IN_BSP_OTHER
  0.364   0.8799   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_SOCKS
  0.332   0.0000   0.5671    0.000   0.99   -4.30  RCVD_IN_BSP_TRUSTED
  0.063   0.1530   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_WINGATE
  1.061   2.5249   0.0270    0.989   0.96    0.64  RCVD_IN_NJABL_SPAM
  0.697   1.6067   0.0540    0.967   0.90    1.10  RCVD_IN_SORBS_SMTP
  1.520   3.4430   0.1620    0.955   0.87    1.10  RCVD_IN_SORBS_HTTP

The columns are:

OVERALL%: the percentage of mail overall that the test hits
SPAM%: the percentage of spam mails hit by the rule
HAM%: the percentage of ham mails hit by the rule
S/O: "spam over overall" – the Bayesian probability that, when the rule fires, it hits on a spam message
RANK: an artificial number indicating how "good" the rule is
SCORE: the score listed in the "../rules/50_scores.cf" file for that rule
NAME: the rule's name

The first two lines list the number of messages in the corpora, and the percentage makeup of the corpus as ham vs. spam (so in this example, the corpus is 41.38% spam vs 58.61% ham).

S/O needs more explanation, as it's a key figure. A rule with S/O 1.0 is very very accurate at hitting spam without hitting ham; a rule with S/O 0.0 hits only ham, no spam; but a rule with 0.5 hits both evenly (and is therefore pretty useless).

mkdir spamassassin
    rm spamassassin/bayes*
    echo "use_bayes 1" >> spamassassin/user_prefs

or to turn it off:

No Format
cd masses mkdir spamassassin echo "use_bayes 0" >> spamassassin/user_prefs

Once mass-check completes

If you're using mass-check to test your own rules, the next step is to run hit-frequencies: see HitFrequencies for details. Alternatively, if you're submitting data for a new scoreset, see RescoreMassCheck, or NightlyMassCheck for the nightly QA test.

Usage

Wiki Markup
mass-check \[options\] target ...

-c=file	set configuration/rules directory
-p=dir	set user-prefs directory
-f=file	read list of targets from <file>
-j=jobs	specify the number of processes to run simultaneously
--net	turn on network checks!
--mid	report Message-ID from each message
--debug	report debugging information
--progress	show progress updates during check
--rewrite=OUT	save rewritten message to OUT (default is /tmp/out)
--showdots	print a dot for each scanned message
--rules=RE	Only test rules matching the given regexp RE
--restart=N	restart all of the children after processing N messages
--deencap=RE	Extract SpamAssassin-encapsulated spam mails only if they were encapsulated by servers matching the regexp RE (default = extract all SpamAssassin-encapsulated mails)

log options

-o	write all logs to stdout
--loghits	log the text hit for patterns (useful for debugging)
--loguris	log the URIs found
--hamlog=log	use <log> as ham log ('ham.log' is default)
--spamlog=log	use <log> as spam log ('spam.log' is default)

message selection options

-n	no date sorting or spam/ham interleaving
--after=N	only test mails received after time_t N (negative values are an offset from current time, e.g. -86400 = last day) or after date as parsed by Time::ParseDate (e.g. '-6 months')
--before=N	same as --after, except received times are before time_t N
--cache	Use cached information about atime (generates files in corpus area)
--all	don't skip big messages
--head=N	only check first N ham and N spam (N messages if -n used)
--tail=N	only check last N ham and N spam (N messages if -n used)

simple target options (implies -o and no ham/spam classification)

--dir	subsequent targets are directories
--file	subsequent targets are files in RFC 822 format
--mbox	subsequent targets are mbox files
--mbx	subsequent targets are mbx files

Just left over functions we should remove at some point:

--bayes

report score from Bayesian classifier

Usage: Targets

non-option arguments are used as target names (mail files and folders), the target format is: <class>:<format>:<location>

class	is "spam" or "ham"
format	is "detect", "dir", "file", "mbx", or "mbox"
location	is a file or directory name. Globbing of ~ and * is supported.

"detect" is the easiest format to use. This assumes "mbox" for any file whose path contains the pattern "/\.mbox/i", "directory" for anything that is a directory, or "file" otherwiseA good rule has a very extreme S/O (near as possible to 1.0 or 0.0) and a high percentage of hits in the correct category. In other words,
RCVD_IN_OPM_HTTP is a very good rule, because it hits 5.2028% of all spam mails without hitting any ham at all (no false positives).

Child pages

Versions Compared

Old Version 1

New Version 122

Key

Using hit-frequencies

Configuration File

Using network tests

Using Bayes

Once mass-check completes

Usage

Usage: Targets

Child pages

Page History

Versions Compared

Old Version 1

New Version 122

Key

Using hit-frequencies

Configuration File

Using network tests

Using Bayes

Once mass-check completes

Usage

Usage: Targets