Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: converted to 1.6 markup

Hit Frequencies

Wiki Markup"hit-frequencies" is a script in the \[wiki:MassesOverview 'masses' directory\] of the [SpamAssassin] source distribution, used to measure rule accuracy and hit-rates, based on the output log files from [MassCheck].

Once you've run MassCheck, you have a "ham.log" and a "spam.log" file. To turn those into a useful summary, you run "hit-frequencies" to generate a "freqs report". Here's how – run:

...

OVERALL%

the percentage of mail overall that the rule hits BR

SPAM%

the percentage of spam mails hit by the rule BR

HAM%

the percentage of ham mails hit by the rule BR

S/O

"spam over overall ratio" – the probability that, when the rule fires, it hits on a spam message

RANK

An artificial ranking that indicates how "good" the rule is.BR

IG

Information gain of the rule, normalized to a value between 1 and 0. Intuitively this shows how much knowing the rule helps to guess the correct classification of a e-mail. In general, RANK works better.BR

SCORE

the score listed in the "../rules/50_scores.cf" file for that rule BR

NAME

the rule's nameBR

The first two lines list the number of messages in the corpora, and the percentage makeup of the corpus as ham vs. spam (so in this example, the corpus is 41.38% spam vs 58.61% ham).

...

Alternatively, "hit-frequencies" has the -o switch to measure overlap; warning, however, this can be quite a bit slower and RAM-hungry than running without it, as it then needs to track a lot more data internally.

Usage

usage:

Wiki Markup
usage:[BR] hit-frequencies \[-c rules dir\] \[-f\] \[-m RE\] \[-M RE\] \[-X RE\] \[-l LC\] \[-s SC\] \[-a\] \[-p\] \[-x\] \[-i\] \[spam log\] \[ham log\][BR]

-c p

use p as the rules directory, default: "../rules"BR

-f

falses. count only false-negative or false-positive matchesBR

-m RE

print rules matching regular expressionBR

-t RE

print rules with tflags matching regular expressionBR

-M RE

only consider log entries matching regular expressionBR

-X RE

don't consider log entries matching regular expressionBR

-l LC

also print language specific rules for lang code LC (or 'all')BR

-L LC

only print language specific rules for lang code LC (or 'all')BR

-a

display all testsBR

-p

percentages. implies -xBR

-x

extended output, with S/O ratio and scoresBR

-s SC

which scoreset to useBR

-i

use IG (information gain) for rankingBR

options -l and -L are mutually exclusive.BR

options -M and -X are *not* mutually exclusive.BR

if either the spam or and ham logs are unspecified, the defaults are "spam.log" and "ham.log" in the current working directory.

...