Hit Frequencies

"hit-frequencies" is a script in the 'masses' directory of the SpamAssassin source distribution, used to measure rule accuracy and hit-rates, based on the output log files from MassCheck.

Once you've run MassCheck, you have a "ham.log" and a "spam.log" file. To turn those into a useful summary, you run "hit-frequencies" to generate a "freqs report". Here's how – run:

...

No Format

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   6317     2614     3703    0.414   0.00    0.00  (all messages)
100.000  41.3804  58.6196    0.414   0.00    0.00  (all messages as %)
  2.153   5.2028   0.0000    1.000   1.00    4.30  RCVD_IN_OPM_HTTP
  1.219   0.0000   2.0794    0.000   1.00   -0.10  RCVD_IN_BSP_OTHER
  0.364   0.8799   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_SOCKS
  0.332   0.0000   0.5671    0.000   0.99   -4.30  RCVD_IN_BSP_TRUSTED
  0.063   0.1530   0.0000    1.000   0.99    4.30  RCVD_IN_OPM_WINGATE
  1.061   2.5249   0.0270    0.989   0.96    0.64  RCVD_IN_NJABL_SPAM
  0.697   1.6067   0.0540    0.967   0.90    1.10  RCVD_IN_SORBS_SMTP
  1.520   3.4430   0.1620    0.955   0.87    1.10  RCVD_IN_SORBS_HTTP

The columns are:

OVERALL%

the percentage of mail overall that the rule hits

BR


SPAM%	the percentage of spam mails hit by the rule

BR


HAM%	the percentage of ham mails hit by the rule

BR


S/O	"spam over overall ratio" – the probability that, when the rule fires, it hits on a spam message
RANK	An artificial ranking that indicates how "good" the rule is.

BR


IG	Information gain of the rule, normalized to a value between 1 and 0. Intuitively this shows how much knowing the rule helps to guess the correct classification of a e-mail. In general, RANK works better.

BR


SCORE	the score listed in the "../rules/50_scores.cf" file for that rule

BR


NAME	the rule's name

BR

The first two lines list the number of messages in the corpora, and the percentage makeup of the corpus as ham vs. spam (so in this example, the corpus is 41.38% spam vs 58.61% ham).

...

S/O stands for "spam / overall" for which the formula is "spam% / (ham% + spam%)", in other words, the proportion of the total hits that were spam messages. As such, it is equivalent to Bayesian probability, or Positive Predictive Value in bioinformatics or medicine.

...

Alternatively, "hit-frequencies" has the -o switch to measure overlap; warning, however, this can be quite a bit slower and RAM-hungry than running without it, as it then needs to track a lot more data internally.

Usage

usage:

Wiki Markup
usage:[BR] hit-frequencies \[-c rules dir\] \[-f\] \[-m RE\] \[-M RE\] \[-X RE\] \[-l LC\] \[-s SC\] \[-a\] \[-p\] \[-x\] \[-i\] \[spam log\] \[ham log\][BR]

-c p	use p as the rules directory, default: "../rules"

BR


-f	falses. count only false-negative or false-positive matches

BR


-m RE	print rules matching regular expression

BR


-t RE	print rules with tflags matching regular expression

BR


-M RE	only consider log entries matching regular expression

BR


-X RE	don't consider log entries matching regular expression

BR


-l LC	also print language specific rules for lang code LC (or 'all')

BR


-L LC	only print language specific rules for lang code LC (or 'all')

BR


-a	display all tests

BR


-p	percentages. implies -x

BR


-x	extended output, with S/O ratio and scores

BR


-s SC	which scoreset to use

BR


-i	use IG (information gain) for ranking

BR

options -l and -L are mutually exclusive.BR

options -M and -X are *not* mutually exclusive.BR

if either the spam or and ham logs are unspecified, the defaults are "spam.log" and "ham.log" in the current working directory.CategorySoftware

Child pages

Versions Compared

Old Version 8

New Version 113

Key

Hit Frequencies

Usage

Child pages

Page History

Versions Compared

Old Version 8

New Version 113

Key

Hit Frequencies

Usage