Using mass-check To Test Rules

"mass-check" is a tool included with the SpamAssassin source distribution to test rules for accuracy and hit-rate. If you're writing custom rules, you really should use this to test them.

First, you need HandClassifiedCorpora. Let's say that's divided into two maildir folders, "/path/to/ham" and "/path/to/spam".

Next, cd into the "masses" directory of the source distribution:

    cd masses
    ./mass-check --progress ham:dir:/path/to/ham spam:dir:/path/to/spam

This will create two files, "ham.log" and "spam.log" containing hit-rates from the rules in the rules dir "../rules" as they are applied to that corpus.

(mass-check also takes other options to control whether network tests are run, whether multiple processes are run in parallel, etc.; read the comments at the top of the file for details.)

Using hit-frequencies

Next, to turn that into a freqs report, run this command:

    make clean
    make freqs

That will take "ham.log" and "spam.log" and generate a "freqs" file from the data. This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham. Its format looks like this:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   6317     2614     3703    0.414   0.00    0.00  (all messages)
100.000  41.3804  58.6196    0.414   0.00    0.00  (all messages as %)
  2.153   5.2028   0.0000    1.000   1.00    4.30  RCVD_IN_OPM_HTTP
  1.219   0.0000   2.0794    0.000   1.00   -0.10  RCVD_IN_BSP_OTHER
  0.364   0.8799   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_SOCKS
  0.332   0.0000   0.5671    0.000   0.99   -4.30  RCVD_IN_BSP_TRUSTED
  0.063   0.1530   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_WINGATE
  1.061   2.5249   0.0270    0.989   0.96    0.64  RCVD_IN_NJABL_SPAM
  0.697   1.6067   0.0540    0.967   0.90    1.10  RCVD_IN_SORBS_SMTP
  1.520   3.4430   0.1620    0.955   0.87    1.10  RCVD_IN_SORBS_HTTP

The columns are:

OVERALL%: the percentage of mail overall that the test hits
SPAM%: the percentage of spam mails hit by the rule
HAM%: the percentage of ham mails hit by the rule
S/O: "spam over overall" – the Bayesian probability that, when the rule fires, it hits on a spam message
RANK: an artificial number indicating how "good" the rule is
SCORE: the score listed in the "../rules/50_scores.cf" file for that rule
NAME: the rule's name

The first two lines list the number of messages in the corpora, and the percentage makeup of the corpus as ham vs. spam (so in this example, the corpus is 41.38% spam vs 58.61% ham).

S/O needs more explanation, as it's a key figure. A rule with S/O 1.0 is very very accurate at hitting spam without hitting ham; a rule with S/O 0.0 hits only ham, no spam; but a rule with 0.5 hits both evenly (and is therefore pretty useless).

A good rule has a very extreme S/O (near as possible to 1.0 or 0.0) and a high percentage of hits in the correct category. In other words,
RCVD_IN_OPM_HTTP is a very good rule, because it hits 5.2028% of all spam mails without hitting any ham at all (no false positives).

Child pages

MassCheck

Using mass-check To Test Rules

Using hit-frequencies